BOCDHZIT B2SDHE 



3D 126 135 



95 



52 005 391 



ZSSSiTflTlOS 
SP05S AG2SCI 
503 DiTZ 

L'OTS 



2DBS PBZC5 
DZSCEIPTOas 



IDE15TIFZ53S 



Horst, Sosald P.; Taiisadge^ G* Sastes 

a ProceSnral Gnide ?or ?aiidatiEg achieve xent Gains 

XB Educational Projects. Holograph Series on 

ZvaluatioB in Sdacatios, He. 2. 

B!5C 2eseairc2i Corp^^ ^os iltos, Calif. 

Office of Sdacation {SHS8} , Sashington, B.C. 

76 

OZC-0-73-6662 

103p.; For a related docnzent, see 2D 096 30^ 
Superintendent of Documents, O.S* Governaent Printing 
Oxfiirey Sashingtoa, D.C. 20^*02 ($2«10) 

HF-$G*83 5C-36.01 Plus Postage, 
^cadeaic ^chieveiDent; ^IcLieTesent Gains; 
Ccfflpensatory 2ducation Programs; Control Groups; 
Criteria; Criterion Referenced 2:ests; Data 
Collection; Decision Saking; *Deaonstration Projects; 
2ducational Programs; Grade Zguivalent Scores; 
♦Guides; Hathesatical Sodels; Keasaresent Sechnigues; 
Models; Bom Beferenced Tests; Soras; ^Prograa 
Effectiveness; *?rograx 2valuation; aesearch Design; 
Eesearch Probleas; Selection; Standardized Tests; 
Statistical Analysis; Tests of Significance; 
♦Validity 
Percentile Noras 



ABSTESCT 

The orientation of this report is that of identifying 
educational projects jihich can be considered clearly ezeaplary. The 
largest section consists of a 22-step procedure for Talidating the 
ef fectireness of educational projects using existing evaluation data. 
It is not intended as a guide for conducting evaluations but rather 
for interpreting data assesbled by others using a wide variety of 
experiaental and guasi-experiaental designs. As such, its coverage is 
not restricted to "good" designs. It encompasses all of the coamonly 
esployed evaluation ucdels, but is not so auch concerned irith 
assessing the relative usefulness of various designs as vith the 
deficiencies and hazards inherent in each of thea. Zt also offers 
suggestions for correcting those results vhen certain aeasureaent or 
analysis principles have been violated. Included as appendices are a 
discussion of the issues surrounding use of criterion-versus 
nora^referenced tests, description of the logic and aatheaatical 
structures of certain regression aodels, and an overview of the 
hazards associated with the use of percentiles and grade eguivalent 
scores to describe acadeaic perforaance. (Author/B2P) 



Documents ac4*iired by ERIC indude many Informal unpubHshed materials not available from other sources. ERIC makes evecy 
effort to obtain the b^t copy available. Nr/ertheless, items of marginal reprodiidbiHty are often encountered and this affects the 
quality of the microfiche and hardcopy reproductions ERIC makes available via the ERIC Document Reproduction Service (EDRS). 
EDRS u not responsible for the quality of the original document. Reproductions supplied by EDRS are the best that can be made from 
the original. ^ 



ERLC 



A PROCEDURAL 

GUIDE FOR 

VALIDATING 

ACHIEVEMENT 

GAINS IN 

EDUCATIONAL 

PROJECTS 




Number 2 
in a Series of 
Monographs on 
B/aluatfon in 
Education 



iERlC 



4Cr 



A PROCEDURAL 
GUIDE FOR 
VALIDATING 
ACHIEVEMENT 

GAINS IN 

EDUCATIONAL 

PROJECTS 



U^. DEPARTMENT OF HEALTH, EDUCATION, AND WELFARE 

David Mathews, Secretary 
Vir&m3Y.Jrcner,Assistant Secretary for Education 

Officir of Education 

T.H. Bell, Commissioner 



The research reported herein was perfonsed pursuant to a contract 
with the Office of Education, U- S- Departoent of Health, Educa- 
tion, and Kelfare. Contractors undertaking such projects under 
Goyernment sponsorship are encouraged to express freely their 
professional Judgnient in the conduct of the project. Points of 
view or opinions stated do not, therefore, necessarily represent 
official Office of Education position or poiicy. 



O.S. GOVERIiHEriT PRIIiTIKG OFFICE 
WASHIIiGTOH: 1976 



ERIC 



4 



This is the second in the Office of Education's series 
of evaluation handbooks. It coniplenjents the first* 
by approaching the problesa from a different viewpoint — 
that of the interested party reviewing evaluation re- 
sults and selecting exemplary projects based on thea* 
Written by G. Kasten Tallcadge and Donald P. Horst of 
the Kcuntain View, California Office of fSiC Research 
Corporation, it is a product of contract OEC-0-73-6662 
entitled, "Planning Study for the Development of Pro- 
ject Infonnation Packages for Effective Approaches to 
Croipensatory Education.** 

Review and appraisal of an evaluation's procedures ai^ 
presented in a series of steps. The handbook thus leads 
the reader systematically to a judgment of whether or 
not the evaluation's results are valid* It also offers 
suggestions for correcting those results when certain 
measurement or analysis principles have been violated* 
Included as appendices are sample project summary V!?ork- 
sheets, a discussion of the issues surrounding use of 
criterion-versus norm-referenced tests, description 
of the logic and mathematical structures of certain 
regression nwdels, and an overview of the hazards as- 
sociated with the use of percentiles and grade equivalent 
scores to describe children's academic performance- 
Other handbooks forthcoming in the series organized 
by the Office of Planning, Budgeting, and Evaluation 
will discuss procedures for using criterion-referenced 
tests in evaluation, for assessing children^ affective 
growth, for estimating standard replicable project costs, 
for evaluating non-instructional project components, 
etc. 



Janice K. Anderson 

Office of Planning, Budgeting, 

and Evaluation 
U.S. Office of Education 



* A Practical Guide to Measuring Project Impact on Student 
Achievement, Donald P. Horst, 6. Kasten Tallmadge, and 
Christine Wood, RHC Research Corporation, Mountain 
View, California, 1975. Government Printing Office 
Stock Hmher 017-080-01460, $1,90. 



ERIC 



5 



The present version of this document is the second snajor revisics of 
a report first published In October, 1973. Both the original report and 
Che first revision oved ansch to £dvard 3. Classsan of the U. S. Office of 
Education* s Office of Planning, 3^getisg» and Evaluation (G?££>. As 
Project Officer for the contract under vhich the report vas developed* he 
deserves credit for originally recosntzing the need for such a guidebook. 
The authors are also Indebted to his for his aony thoughtful suggestions 
and for those he solicited from his professional colleagues throughout the 
ojiginal vriting and revision processes. 

The present revision oves its origin to Janice K. Anderson , also of 
0?3S, who has taken on responsibility for the entire series of Monographs 
on Evaluation In Education . We are indebted to her for her atany good ideas 
and for the eocouragesient she provided in getting us to think through the 
issues one aiore tlae. 

Ve are also indebted to Paul Horst for the very great assistance he 
provided vlth Appendix C of the report and for his frequent additional com* 
3&ents and suggestions. 

Kany other neabers of the EMC Research Corporation staff also helped 
in various capacities » and ve are nost grateful to all of thea. 

G. Kasten Tallaadge 
Donald P. Horst 



iv 



TABLE OF CONTENTS 

FOHEtfOSD .ill 

ACI2i0yLn>GKEOTS iv 

T>3L£ CF COSXESrS v 

LIST Or TASLES AKD rIGDRES vl 

I. ICTEQDUCTIOS - - - 1 

II- PSELIMIHAay SCaENIKG OF CAKOIDATE PSOJECTS. ^ 

III- E7ALUATIKC P20JECT EFFECTIVESESS 7 

IV- DECISION TREE FOR VALIDATIKC STATISTICAL SICNIFICASCE- - . - 13 

V- ADDITIONAL COKSIDERATIOSS ^8 

APPENDIX A 

Project Selection Criteria Worksheets 51 

APPENDIX B 

Kora-referenced versus Criterion-referenced Tests- - - . - - 55 
APPENDIX C . 

Estimation of Trea^-aent Effects from the Perfonaance 

of Non-co»parable Control Groups 60 

APPENDIX D 

iiazards Associated with the Use of Percentiles and 
Grade-equivalent Scores 74 

REFERENCES 93 



V 

7 



LIST OF TABLES AKD FICURES 



Table Page 

1 Kosthly Crade-equivaleat Cains la Beadiszg at the 
22nd Percentile on Tests vith Tvo ^i^irical Konnatlve 

Data Points - - - - 84 

2 Kontbly Grade-equivalent Gains In Reading at the 
22Dd Percentile on Tests vith One Eapirlcal ^formative 

Data Point 86 

3 Hean Reading Coiaprebenslon Scores for Tvo Hypothetical 
Students on the Coisprebensive Tests of Basic Skills 

(Fore a) r - . . 91 

Figure 

1 Decision tree for validating statistical significance. . 46 

2 Regression Projection Model. 63 

3 Cognitive growth shovn by the test publisher's laedian 
versus a xoore realistic expectation 77 

4 Publisher's percentiles corresponding to the ''real** 
nedian In Figure 3 at the beginning and end of each 
nonaing period 78 

5 Comparison of the sedian score vith the grade nons 

line 81 

6 Hypothetical relationships betveen grade-equivalenC 

score and reading skill 90 



ERIC 



8 



1. IlOTOTOCTira 



Tbis report was developed la conjunction with Contract No. OEC- 
0-73-6662 entitled, "^c Dcvclop■^nt of Project Inforxation Packages 
for Effective Approaches in Coapensatorr Education." As its naae 
ijiplies, the contract effort was primarily focused on packaging con- 
cepts aad procedures ^-hich would facilitate the replication of sound 
educational practices. There was great concern, however, that the 
projects selected for replication should indeed be execplary in pro- 
ducing significant cognitive achieveaent benefits. 

Because the selection process was to be based on existing data 
derived froa a wide variety of experimental and quasi-experimental 
evaluation designs, it was necessary not only to establish criteria 
for the statistical and educational significance of achieveaent gains 
but also to define procedures for verifying that these criteria were 
oct. This latter task was not regarded ligjitly, but it was, the 
authors felt, soaething which could be accooplished in a straightfor- 
ward manner by borrowing liberally froa the work of Campbell and 
Stanley (1963) and others. It did not sees likely that much original 
work would be required, or that this report would contain any signifi- 
cant infomation not already present in widely read evaluation texts. 
These initial impressions, however, vere quickly to be rejected. 

It was not long after work on the validation procedure began 
that it became necessary to put aside the well documented issues of 
experimental design and statistical inference and to probe the nether- 
world intricacies of achieveaent test scores and normative data. 
Facts which appeared to undermine the validity of inferences drawn 
from nearly all locally conducted evaluations quickly came to light 
as this exploration proceeded. The problems were so fundamental that 
the authors found it hard to believe they were not well known— yet 



they were able to find little in the literature which was more than 
marginally relevant. 

Before they started work on the validation procedure, the authors 
considered theaselves reasonably sophisticated in both the theory and 
practice of educational evaluation. There were, however, a number of 
details which had escaped their attention. They were not aware, for 
example, that a child scoring In the lowest quartile of the national 
distribution could aake gains greater than Donth-for-nonth over an 
entire school year and end up farther below the norm than he began. 
They did not know that a f iftieth-percentile third grader could be 
2.5 months below grade level in reading— or that an educational pro- 
gram coulcf appear highly successful if the pre- to posttest interval 
spanned the twelve months from 1 Kay to 1 Kay but would resemble an 
instructional disaster if pupils obtained the same scores on tests- ad- 
ministered one day earlier. 

These outrageous incoherencies were just a few of the "horror 
stories" uncovered in the course of routinely examining real-world 
evaluation studies. The sad part, was that these or similar Irration- 
alities were so pervasive that not a single evaluation report was found 
which could be accepted at face value! Even more disheartening— many 
of these evaluations followed procedures officially sanctioned by one 
or more presumably authoritative groups of experts. 

With each new discovery it became increasingly clear that this re- 
port would have new things to say and would have significant implica- 
tions beyond the scope of the effort which spawned it. For this reason, 
it has undergone several revisions intended to increase its general 
usefulness. One significant change involved removing as much as possible 
of the material which dealt with project selection criteria unrelated 
to cognitive achievement benefits. Discussion of these criteria (cost, 
availability, and replicability) was clearly specific to the contract 
effort and appeared to detract from the usefulness of the report for a 
broader audience. 

While the coverage of the report has changed somewhat from e^jlier 



10 ^ 



vcrsioas. Its foraat remains the sane. The largest section of .the 
report consists of a 22-8tep procedure for validating the effective- 
ness of educational projects using existing evaluation data. It is 
not intended as a guide for conducting evaluations but rather for 
Interpreting data assembled by others usitig a wide variety of experi- 
mental and quasi-experimental designs. As- such, its coverage is not ^ 
restricted to "good" designs. It encompasses all of the commonly 
employed evaluation models, but is not so much concerned with assess- 
ing the relative usefulness of various designs as with the deficiencies 
and hazards inherent in each of them. 

One additional point should be mentioned .here. The orientation 
of this report is that of identifying educational projects which can 
be considered clearly exemplary. UnfortunatelyT in minimizing the 
probability of classifying an unsuccessful project as successful* the 
dec is ion- tree procedures somewhat increase the probability of re- 
jecting projects which may really be successful. If the goal were 
to identify unsuccessful projects for the purpose of terminating 
them rather than successful 'projects for replication purposes, a 
different orientation would be more appropriate. 



II. PRELIMINARY SCggU;NG OF CANDIDATE PROJECTS 

The process of selecting and validatihg exemplary educational 
projects is viewed as iterative in nature with each criterion area 
examined at several preliminary" levels before analysis is undertaken 
at the depth which will ultimately be required • The specific steps 
^ to be taken and the criteria to be used will vary as a function of each 
study's particular objectives. The variations, however, should hot 
reprjesent major departures from the general strategy which was employed 
in selecting exemplary compensatory education projects for packaging* 
This strategy is described below. 

The process began with defining the population from which projects 
were-^to be drawn, assembling a list of candidate projects^ and solici- 
ting available documentation from each of them. , .When these tasks were 
completed, the investigators had in their possession an incomplete 
^^colleo^tion of ;reports, data, and promotional literature on each can- 
didate project* 

/ ^ 
Wimowing this information, identifying and obtaining needed sup- 
plementary data, and weighing the resulting evidence was a complex 
task.^ It required a substantial investment ^of efjEort including mail 
and telephone conmunication with project personnel and usually at least 
one site^ visit. Typically, it was not feasible to apply the entire 
.Rupees 8 to all candidate projects, and seme preliiriinary screening pro- 
cedures were required. Pro'jects which passed the preliminary screening 
criteria were considered "possible" candidates for validation, and all 
criterion areas were then systematically investigated in greater depth. 
When there was doubt as cto whether or not a project had met one of the 
preliminary criteria, the project was liot rejected immediately. ' Rather, 
attention was focused on the specific criterion in question so that 
definitely unsuitable projects could be identified and rejected with 
a^ninimum of supc^rf luous^^ef f ort . * « 

1 



12 



Appendix A contalos a set of worksheets vhich were developed to 
facilitate the prelimixiary screeaiag of co^ensatory educatioa projects 
vhich vcre candidates fjr exe^lary status. While the specific cri- 
teria applied to this screening effort «ay not be widely applicable 
vithotit liodif ication, the worksheets should serve as useful aodels 
for any sijdlar types of screening. 

The first worksheet was completed for every candidate project 
and provided a record of the disposition of the project, the first 
two sections, "Description" and "Prerequisites," were coiplcted as 
the first step in processing iaforasaticn received fro« a project. 
Infcrsation under these headings served to verify that the caxsdidatc 
project did indeed cose froB the population being considered. Ihe 
third heading, "Final Assessment" was used later to suwrlze the 
results of the investigations in each of the four aiajor criterion areas- 

Tne second worksheet, "Preliainary Screening Criteria" co^rises a 
checklist which was used for all projects which act the prerequisites. 
Checks were »ade whenever it was possible to detemine that a criterion 
had been saet- Conversely, if it could be determined that one of the 
criteria was not cet, the project was ia»cdiately rejected and no 
effort was spent examining other areas. Where doubt existed, efforts 
vere focused on the questionable areas one at a time until either it 
was determined that all criteria were met or the project was rejected. 

Ihe third worksheet, entitled "Analysis of Project Evaluation," 
was used to describe the tryout design in such a way as to suoesarize 
the evidence of effectiveness and provide a context for its interpre- 
tation. 

The use of forms such as those included in Appendix A for sum- 
marizing and recording preliminary screening information may give the 
nisleading impression that the screening process is quite rigorous. 
In fact, it is no more than a coarse grouping procedure whereby edu- 
^tional projects are categorized as (a) apparently meeting the 
selection criteria, (b) apparently not meeting the selection, criteria, 
or (c) can't tell. Even the distinction among these groups is not at 



KLC 



13 



all dear-cuc in the effecciv^ztess area vhere sistxse of ezperiaestal 
designs and scatisrital procedures is quits coman and affects results 
In vxys tb^t are sot easily decipherable. 

It vas decided tbac the detailed validation procedures vould be 
applied solely to projects vhlch appeared, on the hasls of prellainary 
screenings* ro meet the selection criteria. Only if the nusber of 
such projects which survived validation ^^as inad^nate vould it be 
necessary to dip into the "^can't tell" category. At that point » 
validation procedures vould be applied to those projects vhich the 
inves'^igators felt vere aast promising based on whatever circusstan- 
tial evidence they could asseable. 

This process vould continue, one project at a tlae, until either 
the "quota" vas filled or until it becaae clear that the original 
classification had been excessively optiaistic and that the probability 
of finding additional successes vas so reaote as to suggest abandoning 
the search. 



14 



Assessias effectiveness of 3tl educ^tlu:«al project presents an 
Intrinsically difficult problea. The evaiuacor faces laxny pitfalls 
vbich ai3T be broadly categorized as relating to aeasures^at, ezperineatal 
design, or statist ics« Sazards exist in each of these areas f^ich nay 
coapietely invalidate any inferences v^ich sig^t be dravn from the data 
regarding project ispact. 

Conventions for experimental design and associated statistics have 
been developed to deal effectively vith evaluation problests in controlled 
experimental settings. Standard reference boolcs describing these con- 
ventions are widely available (e.g., Viner, 1971) and are veil known 
to aost evaluation specialists. Unfortunately, in the real vorld of 
education it is often in^ossible to esploy rigorous tec h niques, and it 
is extreaely rare to find a cos»ensaCoiy education project vhich satis- 
fies all, or even zaost of the fundaoental principles of good research 
design. The probleza is so widespread, in fact, that if one were to 
reject all projects ;«ith less-than-idcal evaluations, the possibility 
of finding even a few exenplary projects w^uld be extreaely reiaote. 

Many of the weaker designs have been discussed at length by 
Ca2::pbell and Stanley (1963) along with the "threats to Internal and 
external validity** associated with each. These authors, however, have 
hardly touched upon the equally iiaportant and related problests of 
educational oeasuressent. Scoring, scaling, and norz&ing consideratiozis 
are fundaoental to all educational evaluations and are particularly 
critical to those designs which enpioy non-equivalent cos^arison groups 
or no cojaparison group at all. 

The extent and cooplexity of the experimental and saeasurestent 
probleos inherent in evaluation call for a systesiatic procedure for 
reviewing project evaluations, for identifying and assessing the iepact 
of their shortconings, and for csaking reasonable Judgments regarding 



project effectiveness vhile carefslly weig^ins relevant fetors* 
To wtet this ceed» a 22-step decision tree vzs developed^ The decision 
tree v^s desl£ned to insure appropriate considers t ion » in any evaluation 
study, of each of the threats to valid inference dlscxissed hy Ca3ss>l>ell 
and Stanley (1953) relevant to the specific design eicployed. It also 
f^ccses attention on problems related to whether coisparlsons are scade 
a^inst control s^^o^?^ or are nora-referenced, the type of scores on 
vhlch statistical operations are perfonoed (rav, standard « scale* per- 
centile» srade-equlvalent) » and the bases on vhlch treatsKmt-control 
(or nona S^oup) cosparisons are isade (posttest scores » adjusted post test 
scores7 sain scores, etc.)* 

A procedure of this type cannot, of course, he applied in a vacmza* 
It nast be tied to pre-established criteria to vhlch each judgaent can 
be related* Principal aaong these criteria are (a) the minfiMrm incre- 
sent of cognitive benefit vhlch will be considered educationally sig- 
nificant and (b) the ainlais non-chance probability level vhlch vill be 
accepted as statistically significant • 

It should be pointed out that the establisfaaent of criteria for 
educational and even statistical significance is a scatter of policy 
decislon-tsaking and has only tenuous ties to ''science.'* Vhile it is 
clear, for ezas^le, £ha« the goal of coiipensatory education is to raise 
the achievement levels of ils^'^Tantaged children froa soae starting 
point to an end point vhlch is closer to the national nora, the anount 
of gain required for projects to be considered exex^lary is altaost 
entirely a zsatter of opinion. The only scientific issue is that of 
selecting or developing a suitable laetric for quantifying the cognitive 
gain criterion. 

The use of grade-equivalent scores has appeared to offer a conven* 
lent solution to the problem* It is intuitively logical that, regardless 
of how far belov the national nom a child aay be, if he aakes gains 
vhlch are greater than conth-for-aonth he vill iaprove his status. It 
is also intuitively logical that if he sakes gains vhich are less than 
aonth-fot'-iBonthv he vill fall farther behind the national nom* Thus 
it has been coanon practice to assess cognitive grovth in teres of 




B 




ERLC 



16 



Xrmde-t^itsinlaxt jzlns per wnth of project exposure mad to acc^t ytfrs 
eqtisl to or gxeater than aoatls-for-aonth as educationally slyiiffrant, 
fofortcnately. this cosvestioo is fuodaaentally uztsonod and oftec leads 
to incorrect inferences about the iapact of special isstruccional 
projects. 

Because copxitive ^ovth is not a linear fonction of tise either 
between or vithln years, because test publishers do sx>t collect enough 
noraatiTe data to construct aore aeanln^al zav-to-jradc-equivalent* 
scozre conversion tables, and because a lot of interpolation, extrapola- 
tion, and c urve - sa oo thing is alvays involved, srsde-equivalept scores 
siaplj do not behave in a fashion vfaich is consiirtent vith intuitive 
or logical expectations* these and other technical probleas associated 
vith 2;rade-equivalent scores and grade-equivalent gains are discussed 
in detail later in this report « and exanples of same of the incohezencies 
vhich actually occur Sn real-vorld situations vere presented in the 
Introduction. Eere it Is sufficient slaply to say that such scores do 
not provide a suitable aediua for aeasuring the achieveaent gains that 
say result froa coapensatory education projects* 

Even if grade-equlva3Lent scores possessed the characteristics vhich 
they are typically prestnaed to have, the jionth*for-«>nth seasure of 
effectiveness would be deficient in that it would systeaatically dis- 
criainate against projects serving those aost in need of cospexisatozy 
prograas. this systematic bias steals froa the all*but*trivial fact 
that Increasing an achievaisent growth rate frpa O.S to 1.0 aenth-per- 
aonth requires less reaedlation than ralsing>>ne froa 0.7 to 1.0. A 
sore equitable aeasure would be one which is independent of the iriitial 
degree of disadvantageaent of the children being served. 

In order to be independent of initial achieveaent status, any 
aeasure of gain sust be expressed in tenss o^ an equal-interval scale, 
i.e. » the units of the scale oust be the saae size over the entire range 
of scale values so that a gain of five points represent'^ exactly the 
sazae aaount of cognitive growth regardless of whether it occurs at the 
low end of the scale, the alddle of the scale, or the high end. 



9 

17 



Soncallzed standard scores cossprlsc such a scaLe and thus provide as 
equitable saetric for <io2ntifyiag gains, ihere is another prohlea ia 
quantifjis^ t^inst hovcver, vhich relates to the noR-cocparability of 
iafomatlon derived from one scale of noraalized standard scores with 
that derived from another. 

A standard ^core is siaply the difference between a particular 
■•observed" score and the acan score of the total group tested , expressed 
in standard deviation units* _ 

standard score • — ^ 

X 

As such, its value (size) depends on both the nean and the standard 
deviation of the particular group vh:xh was tested. Different groups 
of course, can be expected to have different atean scores and different 
standard deviations; thus there will be no coicparabiiity between 
standard scores or standard score gains from group to group. 

To solve the conparability problea, it is only necessary to use 
standard scores which are referenced to the aean and standard deviation^ 
of a nationally representative saisple rather than the values derived 
froa the particular group tested. If, for exanple, several different 
groups of children were tested^*r"the beginning of third grade, the 
scores of each child^ixmlS be expressed as deviations froa the national 
average ior beginning third graders divided by the standard deviation 
joi the national population of these children. Scores derived in this 
way would provide a suitable metric for quantifying gains and would 
also enable equitable ccjiparisons of gains to be aade among projects 
serving children with different degrees of Initial disadvantageaent. 

These considerations led the authors to' advocate the use of stan- 
dard score gains referenced to the national nona as the nediua in which 
to cast whatever definition of educational significance eight be 
decided upon. Subsequently* a gain of one-third standard deviation 
with respect to the national ^lom was chosen (on soaewhat arbitrary 
groimds) as the criterion to be used in the national packaging effort 
for detemining exes5>lary status. In that study, for a project to be 



10 



ERIC 



-18 



considered for pacjcsging, the aeaa posttest standard score of project 
participants had to be oae-third standard deviation higjier vith respect 
to t2ie national nora than the aean pretest score of the saae children. 

Criteria for educstionally significant gains vill vary as a func- 
tion of each study's objectives. The 22-step decision tree was developed 
so as aot to be irrevocably tied to either standard scores or to gains 
of one-third standard deviation. It is ?>oth aore general and aore 
pen&issive than the specific criteria vhich vere adopted for selecting 
exeaplary projects under Contract Ko. OEC-^73-6662. It is, in fact, 
independent of any specific criterion. 

Many, if not saost of the steps in the decision tree explicitly call 
for judgaents froa the evaluator. At each step it is assuaed that the 
evaluator is thoroughly familiar vith the issue:s involved and is quali- 
fied to snake a judgsent based on coisplez technical considerations. Each 
decision-tree step is accoapanled by a discussion vhich is intended to 
define the question that is to be ansvered, but little or no atteapt is 
oade to e3g>Iain the underlying probless. Such e^lanations are included 
in separate appendices in ixistances vhere cosiaonly accepted principles 
or practices are discredited and wiiere new or unusual approaches are 
endorsed. 

It is assuaed that the evaluator is faoiliar vith the relevant 
statistical tools and vill apply then appropriately in xsaicing his deci- 
sions. For this reason, standard statistical procedures are discussed 
briefly, if at all. More issportantly, it should be pointed out that 
educational evaluation is, and probably will continue to be, an inexact 
science. Even where the raost po-^'erful de^figns are xised, it vill be 
possible to generate plausible hypotheses attributing the observed re- 
sults to soac influence other than the instructional treatment or to 
factors unique to the tryout site in question. Where weaker designs are 
employed, it vill be highly desirable, or even essential, to strengthen 
the validity of inferences regarding project effectiveness by amassing 
as S2uch supporting evidence as possible. In any case, consistency of 
findings across several replications of an evaluation study would con- 
stitute the DOSt convincing kind of supporting evidence. 



11 



19 



Fljure 1, OA pases 46 and -^7, stiisaarires the 22-stcp decisioa tree 
in flov-dlas^aa fornu Each step Is discussed separately oa the pages 
precediss Fis«re 1- 

Ibe particular path to be followed through the decision tree deptxids, 
of course, on the specific design enployed in the evaluation smdj under 
consideration, but each path Is structured so as to focuit attention on 
the design, analysis, and interpretation pitfalls lUcely to be 4 icountered 
using that aodel. Unless a project has been evaluated la several dif- 
ferent ways, substantially fever steps vill be required than the 22 vbich 
conprlse the entire decision tree- Page 2 of Worksheet III, Appexudix A 
was designed for smarlzing the conslderatlozis of each point In the 
decision tree and for recording whatever relevant Judgments are aiade. 

One additional conocnt which should be aade with respect to the 
decision tree relates to the fact that It has a nunber of exit points 
labeled REJECT- The Intent of these exit points Is never that the projEct 
be rejected as unsuccessful. Vhat is rejected Is not the project but the 
evaluation data which. If the decision-tree process has been carefully 
followed, have been shown to be Inadequate as a basis for reaching any 
conclusion with respect to the success or failure of the project. 

It should be clear froa the above and. Indeed, from the decision 
tree Itself that exacting compliance with the conventions of experimental 
design Is not generally feasible In real-^orld educational contexts. 
Throughout this report the explicit emphasis given to the subjective 
co2q>onents of the evaluation process constitutes a deliberate attesapt 
to avoid the siisleading ia^resslon of algorlthinlc rigor that night 
result if the role of Judgaent were obscured by rigid procedures, -arbi- 
trary criteria, and dubious tests of statistical significance. 



12 



IV. 



DECISION TEEE ¥0K VALI0A7IKG STATISTICAL SICMIFICAKCE 



Step i 



Are the test Instnmeats adequately reliable and valid for 
the poptslatlon being considered? 



will be directly related to the reliability and validity of 
the test instruoents used. Evaluation designs which depend 
on both pre- and posttest scores (e.g., regression aodels) 
are especially dependent on highly reliable end valid in- 
strtsisents and, when using such designs, these characteristics 
should be aore heavily weighted in the test selection pro- 
cess than sight be appropriate where conventional experi- 
aental designs are employed. 

Even where conventional expefinental designs are used and 
practical concerns such as testing costs and tine will in- 
fluence instruoent selection, reliability and validity 
considerations mist not be ignored. It should also be 
reoeshbered that the reliability and validity figures cited 
in test publishers* canuals Bay not be appropriate for the 
group being tested or under the circuxBstances involved. 
There are several potential probleas: 

1. The cited reliability coefficients arc likely to 
be xaeasures of internal consistency (e.g., split 
half. Alpha) rather than tseasures of temporal 



Yes Proceed to Step 2 

Ho Heject test scores as saeasures of 
project success 



CoTSimt 



The sensitivity of any assessiaent of instructional iopact 



13 



ERIC 



21 




stability (eg-, test-rctcst). While the two 
types of reliability estiasates tend to be 
closely related, there aay be significant dif- 
ferences, and the concern here is the extent 
to vhich the test vill yield the saae scores on 
successive adainist rat ions. 

2. The cited reliability coefficients are likely to 
be too high if the group to be tested represents 
only a portion of the grade-level span for vhich 
the test is noalnally Intended. 

3. The cited reliability coefficients are 'likely to 
be too high if the group to be tested is re- 
stricted in its range of ability. Reliabilities 
for disadvantaged and gifted groups, for exaaple, 
vill be lover than reliabilities for representa- 
tive groups. A rough reliability estixate for 

a treatment group with a restricted range of test 
scores (e.g., bottom quartile) oay be obtained 
from the folloving forsula (Guilford, 1965, 
p. ^64): 

r_ =1- 



vhere 



r " « reliability for the treataent group 
^t 

r^ *= reliability for the nom group 
n 

s^ =» trcatcsent group pre- or posttest 
standard deviation (whichever is 
smaller) 
° group standard deviation 



22 



This foroula is based on the assuxsption that 
the standard error of zaeasurexaest for the treat- 
saent group is equal to the standard error of 
aeasureiaent for the nona s^^P- 1^ the experi- 
zoental group iseasuresent error is actually 
higher than that for the norss group, this estl2sate 
of test reliability vill be too high (see Stanley* 



1971, p. 362)- 

Floor effects vill further lover reliability for a group in 
the lover tail of the distribution, and a Judgsent mist be 
made as to the ispact of these effects (see Ste; 2). 

It should be kept in aind that test adniaistrat ion and 
scoring procedures isay have important effects on reliability 
and validity. Unless the procedures outlined in the pub- 
lisher^s test x&anual are folloved closely, the obtained 
scores ssay seriously misrepresent achievezient levels. This 
problem is particularly acute vhere the effectiveness of an 
instructional project is assessed by seans ox nono-group 
coa;«arisons. 




15 



Step 2 



Quest loa Are pre- or posttest score distributions of any groups 

curtailed by ceiling or floor effects? 

^ Yes Estiaate the sire of the effect, record 
on the worksheet, and proceed to Seep 3 

No Proceed to Step 3 

CoMPect Ideally, the lowest scoring pupil should score above the 

chance level on the test and the highest scoring pupil 
should score below the xaaxisuo possible score* The actual 
chance level is difficult to esclstate since it depends on 
the guessing strategy of each student. For students who 
guessed randomly on all iteats they didn't Know, chance 
would equal the nuaber of it&ss divided by the nxjcber of 
response alternatives per iteia* Students often leave 
iteas blank, however, even when instructed to guess, and 
when they do guess, their choices are not necessarily 
selected randomly froo all available alternatives* Because 
of these problems, the raost practical way of identifying 
floor or ceiling effects is inspection of score distribu- 
tions for excessl\e skewness« If the treataenc children 
encounter the test floor on pretesting, or the ceiling 
on post testing, their gains will be underestimated. (Gains 
would only be overestliaated where the ceiling was encoun- 
tered on pretesting and/or the floor on posttesting* This 
Ixsprobable event could occur where different levels of a 
test were used for pre- and posttesting but there is gen- 
erally enough overlap between levels so that this type of 
situation can be avoided.) 

If the experinental design eaploys a control group, it 
would be subject to sisailar estimation errors which would 



16 



24 



Chen need to he considered in combination vich chose of 
Che creawzsenc group. 

There is no foolproof cechod of escisiaclng the size of 
ceiling or floor effeccs. In a syaaecrical discribucion, 
hovever, che sean sisd zsedian vill be equal. Cospressing 
one end of che disCribuCJon vill affecc che sean buc noc 
Che 2^ian. The oedian» Chen, siay provide a reasonable 
escinace of vhere che sean vould have been in the absence 
of a ceiling or floor effect. 



17 



Step 3 



Is there reasoa to believe that Che pretesting experience 
caay have been at least partially responsible for the ob- 
served treattaent effect? 

Yes Estiaate the size of the effect, record 
on the vorksheety and proceed to Step 4 

Ho Proceed to Step 4 

If standardized tests are used, and the experimental 
design ecploys a control group, the pretesting experience - 
should have little or oo effect on the outcoi&e of the 
evaluation. Pretesting with criterion-referenced tests, 
hovever, cay sensitize pupils as to vhat they are expected 
to learn. This sensitization &ay interact differentially 
with the learning experience available to treatment and 
control pupils so as to produce greater learning of 
criterion items in the treatment group. 

A core' serious problem arises vhere there is no control 
group because, as Campbell and Stanley (1963) point out, 
"...students taking the test for the second time, or 
taking an alternate form of the test, et<5. , usually do 
better than those taking the test for the first time 
(p. 175]." Since, presumably, children in the norm groups 
took the test only once, this spurious increment would be 
present only in the post test scores of the program partic- 
ipants and could thus lead to erroneous conclusions re- 
ga^^ing^treataent impact* A compounding of this effect 
would almost certainly occur if pretesting were xhe chil* 
dren's first test-taking experience* Under these condi- 
tions, pretest scores might be artificially low. 



2Q 

18 



Assuxis^ soae test-takiag sophist Icat ion , a rule-of- 
thawh estisiftte for the size of the practice effect vould 
be one tenth of a staxidard deviation if the sase form of 
the test vera used for both pre- and posttesting (Levine 
& Aosoff, 1958.) Use of altercate foras vould sisaifl- 
cantly redoce this effect » but is probably an undesirable 
practice except in rare cases vhere utchins of the alter- 
nate foras is nearly perfect. 



27 



19 



Step 4 



. Is there reason to believe tLat knowledge of group neifiber- 
ship a&ay have been at least partially responsible for the 
obseo^ed treataest ^fect? 

Tes Estlatata the si^e of the effect, record 
on the vorScsheety and proceed to Step 5 

yo Proceed to Step 5. 

Knowledge of group seabership xay produce the Hawthorne 
effect in j&eabers of the treataent group or the "John 
Henry" effect (Saretslcy, 1972) in the control group. 
Clhe Havthorae effect is the occurrence 'of a perforsance 
increment which results, not from the effixacy of a par- 
ticular treatment, but sisply froa as awareness that soae- 
thing special is being doz^« See Whitehead (1938) and 
Parsons (1974) for further explication* The John Henry 
effect arises when chose who do not receive special treat- 
aent mate an extra effort in an^atteapt to deaonstr<>te 
that they can do just as well without ^t«] There are 
other spurious influences of this type i^ch aay also 
confuse the issues. Children nay deliberately score poorly 
on a test in order to get into a special prograa or to 
keep froa graduating out of a progran they enjoy* They 
say also score poorly to punish a teacher or developer 
they dislike. 



In theory, oany of these effects could be experiaentally 
controlled through use of a placebo treataent as is coa- 
Bonly done in sedical research. In practice, however, 
this approach is not feasible, and the educational re* 
searcher is lei>. in the unenviable position of having 
no experiaental or statistical techni^e for controlling 
such influences. Although they have a\endency to dis- 




sipate vitb £lz&e» the researcher fcas 20 real recourse but 
£o rely on bis cvn ex;>erience and ^udsmtnt in decldiss 
vbether treataest outcos&es s&ould be attributed entirely 
to treataeut effects or vbetber knovledse of ^roup xember- 
sblp iBcreased or decreased the apparent' iapact. Estimat- 
ing the size of such effects^ of course* can be ^one only 
very crudely ^nd even such JudgBents as "too ssall to 
have produced the observed effect" or "•large-^enoujh to 
have obscured true project ispact** vill alvays be open to 
question. 



21 




Step 5 

Questioa Is there reason to believe that student curoover my have 

been ;^r£lallj responsible for the observed treatment 
effect? 

Tes Estimate the size of the effect* record 
on the vor3csheeCy and proceed to Step 6 

Ho Proceed to St^ 6 

Coaaent Kost often, educational evaluations restrict their reporting 

to Include only pupils for vhom both pre— and posttesc 
scores are available. Vhile this Is the preferred nechod 
for dealing vith the probles, pupils left out of the 
analysis because of incoaplete data are likely to be 
systessatically different froa those included {lover socio- 
econoaic status, ssore z&obile faailles, higher absenteeisa 
rate, higher dropout rate, etc). 

It'here pretest and post test scores are reported on groups 
which are not identical (i.e., soae children have pretest 
scores only and other have just posttest scores) , systea-- 
atlc biases cay be present. Students vho dropped out, for 
exasple, nay have been the lover scorers and thus liave 
contributed tc a spuriously low sean pretest score and 
spuriously high apparent gain. Pupils entering a project 
after it begins may also be atypical azul isay cause posttest 
scores to be either too high or low. ihese possible in* 
fluences'can be checked by cosparlng pretest scores of 
the pretest-only group with those of the pre-aod-posttcst 
group and by follpwing slallar procedures with between-group 
posttest score cosparisons. 



30 

22 

O • - 

ERIC 



Step 6 



Oaestioa Docs the evaluation esrploy a control ^oap? 

Yes Skip to Step 1* 
Ko Proceed to St^ 7 

Coanent The tera •'control group"* Is csed loosely here to concot^ 

any cospaxXson group otber than a noxa group* Vhlle the 
tvo types of groups serve identical purposes, naniely to 
provide an estli&ate of hov veU the treatment S^oxxp. vouLd 
have perforated if it had not received t*he treati&ent, aorma- 
tlve data generally differ substantially from data collect- 
ed on control groups, and different analytic procedures 
zsust be enployed. 

Evaluations based on nora-group coaparisons arc dealt with 
in the branch of the decisicsi tree vhtch begins vlth 
Step 7. Control-group designs are covered in the branch 
beginning with Step 1^. 



31 

23 



Step 7 



5a£2?iS ^'c*** ?r«t«st scores ^ti to select the treatjaeat s^mp? 

Tes EstSsste the size of the regression 
effect, record oa the vorksheer, and 
^ proceed to Step 8 

Proceed to Step 8 

Cossa^ It is often the case that children vith the greatest 

educational need are selected for prosraa participation 
froa a larser ^roup of children. If this selection is 
based on adiievencnt test scores ^ich are stabsetjuently 
treated as pretest soeasures, a spurious negative corre- 
lation is produced betveen pretest performance and gains 
Iron pre- to j^osttest^ Ihis spurious relationship arises 
froa the fact that scores at the low end of the distribu- 
tion reflect a preponderance of aegati^'e cicasureaent 
error vhile those at the high end reflect a preponderance 
of positive ceasareaent error- Isaaediate retesting of the 
extreme groups (using an altez^ate form of the test) vould 
shov the so-called regression effect whereby the mean 
scores of these groups vould saove closer to the original 
total-group xneas than they were on the original test. 

The ssagnitude of the regression effect can be approxi- 
siated by estisating the oean pretest "true" score from 
the test reliability. To obtain this estiiaated ciean tnie^ 
score for a selected subgroup, the subgroup tsean should be 
subtractcii froa the total group ocan and the difference 
Esultiplied by one oinus the test-retcst o-- altemate-^orxa 
(not split-half) reliability. The estinated &€^n true 
score is then obtained by adding the result of these cal- 
culations to the ocan score of the selected subgroup. 



24 



ERIC 



32 



It Is clear that the size of the regrccsioa effect is 
inversely related to the reliability of the test In- 
stxvaent vhich is used. For this reason it is iasportant 
to reaea6er that the reliability coefficients presented 
in the test publisher's osanual are likely to be too high 
for applications where the group tested represents a 
restricted range of ability. Step 1 presents a procedure 
for estimating reliabilities under such ciroxnstances, but 
it should be noted that even these estimates inay be too 
high and the size of the spurious regression ssay thus be 
underest icated . 



25 



33 



Step 8 



Qaegtioa Arc normative data available for testing dates viiich can 

be meaningfully related to the pre- and posttesting of 
the program piipils? 

Tes Proceed to Step 9 

Ho 2eject norm-grocp comparisons as adequate 
evidence of project success 

CoM'iLut Some test publishers have collected normative data at 

more tJwn one point during the school year vhile others 
have -relied on a single data point 'jper year. In either 
case, it is common practice to publish separate norms 
tables for the beginning, middle, and end of each school 
year. Obviously, some of these norms are constructed 
through processes of Interpolation and/or extrapolation. 
These constructed norms, viiile possibly useful for 
counseling or diagnostic purposes, are UVely to be in 
error by amounts large enough to invalidate any inferences 
drawn about cognitive grovth. If they are based on pro- 
jections of more than a month or two, they should never 
be used for assessing the impact of educational influences^ 

Where real (as opposed to constructed) norms are used, 
they should be treated in the same manner as data from a 
control group- While even the most naive evaluators vould 
recognise the folly of testing treatment and control 
groups at significantly different times, test publishers' 
suggestions that their norms are valid over three- or even 
four-month periods are rarely questioned. Clearly, hovever, 
the treatment group Is being compared to a norm group test- 
ed at specific times, and unless the testing times of the 
tvo groups correspond very closely, cay comparisons are 



34 

26 



likely to be quite tBlsle^ding. Ideally, the treatiaect 
group should be tested at tises exactly correspond isg 
to real aonutive data points. If this is not possible* 
linear ixaterpolatloas or extrapolations of a conth or 
even tvo sionths front the specific testing dates on vhicb 
the aoras are based should not Introduce large error 
coi^>onents. Certainly, it is better to interpolate or 
extrapolate than sixtply to use the given conss vhea the 
testing tines differ. (See also Appendix D.) 

Another possibility, vheie testing tises vere. non-com- 
parable, would be to sake explicit the comparisons vhich 
vere sade. An exaxple of this approach aight be as fol* 
lovs: "the aean score on the pretest (adalaistered at 
grade level 7.1) fell at the 24th percentile of the 
grade 7.6 nom group vhile the aean score on the posttest 
(adjtinistered at grade level 7.8) was at the 36th per- 
centile of the S.6 nora group." Vhile this approach cay 
be soaewhat confusing. It is scientifically sound whereas 
other cowonly employed approachss (e.g., use of con- 
structed nonis) are sisiply not seaningful. 



35 

27 



Step 9 



Question Vo the con&s provide a valid baseline against vhich to 

assess the progress of the treatiaent group? 

Yes Proceed to Step 10 

No Heject nonn-group coa^arisoos as adequate 
evidence of project success 

Comment Ideally, the aora group should be a representative saaple 

. of the population from vhich the -treataent group is dravn. 
Ihus, disadvantaged children should be coajiared against a 
disadvantaged nora. While some vork toward the develop- 
ment of such floras has been accosspllsbed, only nationally 
representative floms are available for siost standardized 
achieveoent tests. 



It is, unfortunately, necessary to point out that noming 
practices vary widely froa publisher to publisher and that 
even the best nonas say reflect soae ninor sailing 
deficiencies. Koncative data presented in test publishers* 
oanuals should never be used uncritically without con- 
sideration of the total size and representativeness of the 
nors group. 



When groups of disadvantaged children are coopared against 
"national" noras, they are cocpared against a composite of 
subgroups, some of which nay be lik* then while others are 
certainly not (e.g., non-disadvantaged "late bloomers"). 
For comparisons to be valid, these subgroups tsust maintain 
the saae relative positions with respect to one another 
over tiae, as significant among-group changes would in- 
dicate differential group growth rates with respect to 
the overall norm. At the present tiaie, there is no 
evidence that different group growth rates occur (despite 



28 



ERiC ^ 36 



the iaiplicacioa of "late bloonlng"). Thus, vhllc there 
are potential hazards in using nationally representative 
Doms to zssess the progress of atypical groups, it docs 
not appear unreasonable to do so* 

y&ere treat pent grot^s are clearly special (e.g., non- 
English speaking), national noras should act be assaaed 
to constitute a aeaningful basis for assessing progress. 
One further coaaent should he suuSe vith respect to 
nomative data for grades above the elementary level. 
Since dropouts coae largely f roa the low end of the dis- 
tribution, the percentile standing of the non-dropouts 
will decline. To give an extrezye exai:5>le, if all child- 
ren h*l<rj the tenth percentile vere to drop out, children 
originally In the tenth percentile would icEediately 
becoae first-percentile children. This effect, even In 
less extreoe cases, will cause an apparent negative 
growth rate aaong the non-dropouts. Unfortunately, it 
l7 not possible to adjust for this phcnoiaenon in the 
absence of nationally representative empirical data on 
dropouts. 



ERLC 



29 



37 



Step 10 



QuesMoa Is the cos^rison htt^cen the trcntsaent gwjp and tho- nora 

Sroup based on pre- and post test scores or on gain scores? 

Pre- and post test scores Proceed to Step 11 
Cain scores Skip to Step 12 

Cement Cain scores developed froa raw scores or aosc derived scores 

ere not readily Interpretablc in nona-referenced evaluations 
and cannot be interpreted at all In the absence of pretest 
status inforxnation. The proMca steos froa the fact that 
the no-treatzaent expectation In such evaluations is that 
the group will zsaintaln its percentile standing with re- 
spect to the national norxa froo pre- to post test. Where 
pre- and post test scores are available. It is siopler and 
less subject to error to work with these oeasures directly 
rather than to use gain scores. 

# Grade-equivalent gains appear to be an exception to this 
general rule. Cains expressed as grade-equivalent xoonths 
per oonth of project exposure seea autoaatically to provide 
a comparison with the average child. Not only is this ap- 
pearance erroneous, but scaling and other problems associ* 
atcd with grade-equivalent gains are so severe that these 
scores are oore ois leading than useful (See Appendix D)., 

Cain scores derived fron "regular" standard scores (as op- 
posed to expanded standard scores) constitute the only 
real exception to the need for pretest scores in norm- 
referenced evaluations. Where such scores are provided 
(e.g., for the Gates-HacCinitie) the no-treatcaent expected 
gain is 0.0 points. Unfortunately, very few publishers 
include "regular" st^^ndard scores in their test manuals • 

38 

30 



/ 



Step 11 



Have appropriate statistical tests been employed to assess 
the significance of the gain in treataent group perforaacce 
relative to the nora group? 

Yes SUp to Step 22 
Ko Sldp to Step 13 

The gain of the treataent group with respect to the nom 
is determined by subtracting the expected laean posttest 
score froa the observed zaean posttest score. To find 
the expected i&ean posttest score: 

1« DetenaXne the percentile equivalent of the xaean 
pretest raw or, preferably, standard, expanded 
standard, or scale score. 

2. Enter the nom table appropriate for the post- 
test with the pretest percentile and read out 
the corresponding raw, standard, expanded 
standard, or scale score (the type of score 
nust correspond to. that of the observed nean 
posttest score). This score reflects the level 
of perfomance which would have been expected 
had there been no special instructional treat- 
Dent. 

The statistical significance of the treataent effect can 
be assessed using the fonaula on the following page. 



39 

31 



Y - Y , 



J U-l 



vhere Y = observed ssean pqsttest score 

y = expected seaa posttest score 

« prctesc standard deviation 

= posttest standard deviation 

= correlation betveen pre- and posttest 
scores 

N » nuzsber of children 
K-1 = degrees of freedoa 

Using this forcjula assuses that normative data are avail- 
able for testing dates cozsparable to the pre- and post- 
test adDinisti:::ation tines (see Step 8). It^is also es- 
sential, of course, thnt the noras be derived froa large 
and representative saoples of the treatment group's grade- 
level peers. 

O 

Soxae test nanuals provide sicplified procedures for deter- 
dining the significance of a gain froa pre- to ppstte&t. 
These procedures should not be used, however, as they 
incorporate assumptions about the correlation betveen pre- 
^and , posttest scores which zsay not be applicable to the 
project participants. The significance of the gain should 
be dccenained frooi data in hand. 



32 

40 



Step 12 



Are pre- and/or posctest scores available? 
Yes Proceed to Step 13 

lio Eeject xK>rz3-group cosparisoxis as adequate 
^ evidence of project success 

Szcept in those unus:»l Instances vhere ^In scores are 
derived f ron "regular" standard scores (scores vhicfa have 
been nozoalized standardized indepeadestl]? at each 
nonaative data point), it is not possible to derive gain 
expectations froza thea. Vhere gain scores derived from 
"regular*' standard scores are available, the aean gain 
score can replace the nuDerator of the fonsula given in 
Step 11 and the standard error x>f the gain (the standard 
deviation divided by the nuziber of pupils) can replace the 
denominator of the saiDe equation. 

All other gain scores aie uninterpre table vith respect 
to expectations. Unless, therefore, it is possible to 
retrieve pre- and posttest scores, cono-group cos^arisons 
cannot provide adequate evidence regarding project success. 



33 



Step 13 



Question Can appropriate statistical tests be eirployed to assess 

the significance of the gain in treatiaent grotip perfor- 
aasce relative to the soras group? 



Yes Cospute appropriate statistics and 
skip to Step 22 

So Seject oorza-group cozz^arisons as adequate 
evidence of project success 



Connent If the acan pretest and post test scores and the associated - 

standard deviations are available, the statistical signifi- 
cance of the treatnent effect can be assessed using the 
foroula given in Step 11, p. 32. If these values are not 
available and cannot be cos^ted froo rav data, nom^roup 
conparisons cannot provide adequate evidence regarding 
project success. 



42 



34 



Step 14 



Qoestim Ifere the chlldrea, eltber aatcbed or trzMtched, randoaly 

assisted to the tirearapnt and cooparlson groi^ps? 

Yes Skip to Step 18 
Ko Proceed to Step 15 

Caaaent A "yes** ansver to this question Is^lies that, prior to the 

begisalns of the project, a pool of eligible children 
existed and each child bad an equal chance of being 
assigned to the treatstfnt groo?. It further laplles that 
assignsent vas sude on a purely chance basis without any 
knowledge or consider£.tio2 of the characteristics of thz 
pupils (except, of course, vbere matching vas done prior 
to assigniient). 

If a isatching procedure vas ex^loyed, it should have been 
iiaplei&ented as follows. The entire pool of eligible 
children should have been organized into carefully aatcbed 
pairs on the basis of pretest scores and other potentially 
relevant variables (e.g., sex). One oeober of each pair 
should then have been selected at randoa for assignaent 
to the treatoent group. The re&aining aex&ber of the pair 
vould then, of course, have been assigned to the conparlson 
group. 

Note: Katching^after assignaent to treataent and coa- 
parison groups is a fundaaentally unsound practice. (See 
Step 15.) 



' 43 

35 

ERIC 



Step 15 



Question Is there evidence that aes^ers of the treatnent snd 

coacrol z^tfups belong to the saae population or to popu* 
latlons that are siallar on all educationally relevant 
variables including pretest scores? 

Yes Proceed to Step 16 
^o See Appends^ C 

Coaoent Sandoa assigrmmt vill usually (but not always) produce 

groups which are coszparable. On the other hand, groups 
resulting froa non-randoa processes are likely to differ 
froa one another on educationally relevant dimensions . 
If such differences exist, there is no entirely satis- 
factory sseans of ssaking betveen-group coaparisons. 

As Lord (1967) has pointed out, "If the individuals are 
not ^signed to the treataents at randoa, then it is not 
too helpful to deiionstrace statistically that the groups 
after treatsBent show aore difference than votild have been 
es^ected froa randoa assignaent — unless, of course, the 
e^eriaenter has special inforaation shoving that the 
nonrandoa assi gnaen t vas nevertheless vzxi'io'iL in effect 
[p. 38)." The saae could be said vhere significant pretest 
differences were found betveen groups vhich vere developed 
through randoa processes. 

Where pre-existing, intact groups ar« used as the treat- 
aent and control groups, it is not appropriate to assuae 
that they are, even in effect, randoa saaples frca a 
#ingle population. The probability that they aay be 
Bust be investigated eapirically. At the very least, 

36 



ERIC 



44 



the tvo' z^cisps mxst oot be signif ican tly different In 
tens of pretest scores- They shotild also he comparable 
In teres of socloecoaoslc states, age, sex, and racial 
and etbnlc coi^osltlon. School size and setting (urban - 
roral) as veil as oeljibborbood sbouM also be cosparable. 
Even vlth these factors equated, serious selection biases 
are coaaon. Such biases are introduced vhen teacher or 
student participation is voluntary or vhen experiaeatal 
groups are selected by principals or teachers. 

A coHBOn design error vhere conparable, intact groups 

he found is that of aatching sieabers of the treat- 
sent group yir|i Specific aieobers of other, non-cosparable 
^oups. The assunption here is that a coaparable control 
group can be constructed through the asatchlng process. 
The fallacy inherent in this assunption is that the 
selected subgroup is atypical of the group f ro« vhlch it 
is dravn and vill shov a regression toward the ae^ of 
that group on posttest aeasures. Cesphell and Stanley 
(1963) describe this type of post-hoc aatching as "a 
stubborn, aalsleading tradition in educational e3q>eriaen- 
tation," and as a "hazard" vhich is "frequently tripped 
over Cp. 219]-" 



4 



Step 16 



Ar« post-treatment comparisons made in cerss of postcest 
or gain scores? 

Fostcest scores Skip to Step 19 
Cain scores Proceed to Step 17 

Tvo types of gain score are fre<2uentiy us*d in educational 
evaliiations: raw and residual g^i*s scores. Sav gain 
scores are derived by subtracting pretest scores from 
^sztest scores. Vhea raw gain scores are used» ths sire 
of the treatment effect is defined as the treaceent 
group's raw gain score minus chat of the control group. 
It can be shown that this difference is mathematically 
identical to the treatment group's posttest score aisnis 
the control group's posttest score after the latter has 
been adjusted by the entire amount of the difference be- 
tween the two groups* pretest scores. Compared to co- 
variance analysis, which the authors hold to be the most 
appropriate method to compensate for Initial differences 
between groups, the raw gai^ score adjustment Is ex- 
cessive and results in an overestimatlon of the treatment 
effect when the treatment group's pretest score is lower 
than that of the control group. Conversely, raw gain scores 
underestimate the size of the treatment effect when the 
treatment group scores higher on the pretest than the 
control group. 

Residual gain scores are not really gain scores at all. 
They are differences between observed posttest scores 
and posttest scores predicted from the regression of 
posttest on pretest scores for the combined treatment 

46 

38 





aiid control s«>ups. If the treatistn^^ has been eftcctive, 
observed performance of the treatsent^S^o^P o& the post- 
test vUl exceed the predicclos, vsereas the perfoncance 
of the control sroup vlll fall belcTv the predicted value. 
The stna of the absolute values of the tvo deviations is 
presuioed to yield a seasare of the treatment effect. 
Vhere there is no difference between groups on the pre- 
test, covariance analysis, rav gain scores, residual gain 
scores, and even single posttest coi^risons vill all 
yield exactly the sasie 3&easare of the treatment effect. 
As pretest differences are introduced, however, the 
laeasure of treatment effect obtained from residual gain 
scores systematically diminishes and approaches zero where 
initial betvecn-group differences are large. Where any 
pretest differences exist, a residual gain analysis vill 
alveays underestisste the size of the treatment effect. 



Wherever possible, covariance analysis, preferably with 
an adjustment for test unreliability (e.g., ?orter, 1967), 
should be used to conspensate for initial differences be- 
tween treatment and control groups, — assuming, of course 
that the tvo groups can be regarded as randos saaples 
froa a single population. Statistically significant treat- 
laent effects found vith either residual gain scores or 
rav gain scores vhen the treataoent group Is initially in- 
ferior to the control group constitute adequate evidence 
of project success. The real danger inherent in these 
approaches lies in the rather high probability of rejecting 
projects vhich are really effective. 



47 



39 



ERIC 



Step 17 



C2a ^ta be obtained vhich vould enable appllcatlcn of 
covariance analysis techniques, vould such analyses be 
appropriate, and is there a reasonable expectation that 
they vould produce significant results? 

Yes Conduct covariance analysis and 
proceed to Step 22 

Ko Skip to Step 2^* 

Xherever pretest differences between treatcient and control 
groups have resulted fron randoa assignnent procedures, 
covariance analysis say be esiployed to adjust for these 
differences. Vhere the treatnent group vas superior 
on the pretest, this type of analysis will significantly 
reduce the probability of incorrectly Inferring a treat- 
sent vac successful vhen it vas not. Conversely, where 
the treatiaent group was initially inferior, covariance 
analysis will significantly reduce the probability of 
rejecting a successful treatsncnt as unsuccessful. In 
both instances the covariance adjustment will Increase 
the accuracy of post test saeasures so that the true cag- 
nltude of progran Icpact can be deter&lned. 

There is, of course, no justification for the extra com- 
putational labor required for covariance analysis if the 
two groups obtained equal scores on the pretest. Further, 
covariance analysis Is not required where an initially 
inferior treatoent group scored significantly higher than 
the^ control group on the posttest if interest is restricted 
to the statistical significance of the treat£3ent effect 
rather than an estlcate of its size. 

CO 

48 



Step 18 



Quegtlon Vere pretest scores collected? 

Yes Go back to Ste?^ IS 
Ko Proceed to Step 20 

C o^rn t If assigztaent of pupils to treataent aad control gi^otips 

has been truly randoa. It Is not essential to collect 
pretest scores since valid inferences can be drawn froa 
posttest score comparisons. If pretest scores are col- 
lected, hovaver. sore poverful statistical tests can be 
ezsployed in cases t^re the assignment process has 
resulted in saall Initial differences between the groups. 



ERLC 



41 

49 



Step 19 



Have covariance analysis techniques been employed to adjust 
for initial differences between groups? 

Yes Skip to Step 22 
Ko Go ba!:k to Step 17 

S%^ere assignoent to either the treattaent or the control 
group has been fandon or "randoa in effect" (see Step 15), 
saall pretest score differences zaay be found betveen 
groups. Under these circuzsstances, analysis of covari- 
ance is the cost appropriate statistical technique avail- 
able for testing treataent effects. If the analysis has 
been done correctly, its findings cay be accepted at face 
value. 

Covariance analysis laust never be regarded as an adequate 
technique for statistically equating dissisailar groups. 
It can only be used vhere its assunptions (effectively 
randoa assignment and hoiaogeneity of regression) are jset 
and vhere initial differences between groups are not 
excessive. It should be noted that even where regression 
is st^^istically non-heterogeneous, siaall differences 
in regression line slopes introduce errors into the 
. cozspu tat ions. These errors interact in a zsultiplicative 
fashion with the size of the betveen-group difference. 
A SDall error nultiplied by a big difference becozoes a 
big error. For this reason, it is coiasK^n to use the lOZ 
level for rejecting the hypothesis of hozaogsneous vari- 
ance. Use of the 20Z level would be appropriate when 
the difference between group oeans is large. 



50 

42 



Step 20 



Qisestloa Save appropriate statistical tests been eisployad to 

cos^re posttest or gala scores? 

Yes Skip to Step 22 
Ito Proceed to Step 21 

Conner A wide variety of statistical tests and procedures can 

be used for testing differences between groups. Raw or 
(prefc«:rbly) standard score comparisons may often be sade 
on either posttest or gain scores using paraaetric sta- 
tistical tests such as Student's t^ for independent aeans 
(£ for correlated scores where pupils were catched prior 
to assignment to groups) or analysis of variance. Hew- 
ever, the data should be inspected to confina that Che 
assuzsptions of these tests have been zaet, since score 
distributions from special instructional projects aay not 
oeet requireoents such as norsiality due to test ceiling 
or floor effects or other confojinding influences. 

Where paraoetric test assu^tions are not net, non-paraaetric 
tests such as the Mann-Whitney U or the Koloogorov-Saimov 
test are appropriate but are less powerful than their para- 
netric equivalents. Kon-paracietric tests must also be used* 
where conrparisons are laade between posttest grade-equivalent 
scores (assuaing randoa assignaent) . There is no zaeaningful 
way in which grade-equivalent gains can be coi^ared. 

The cautions regarding the drawing of inferences froia 
gain-score conparisons discussed in Step 16 should be 
carefully observed. 

51 



43 



1 

t 

i Step 21 



Question Can data be obtained vhlch vould er*3b]Le appropriate tests 

to be zsade? 

Yes Obtain data, coispute appropriate 
statistics, and proceed to Step 22 

Ko Reject post test and/or gain score 

cosparisons as adequate evidence of 
project success 

Coaaent Vhere Inappropriate statistical approaches have been 

adopted, there is no choice but to seek out the inforziation 
needed to conduct appropriate tests. If raw or (preferably) 
standard score suaoaary statistics (seans and standard devia- 
tions) are available, ^tests could be done. In oany cases, 
unfortunately, all calculations vill have been done in- 
appropriately (e.g., by using grade-equivalent scores) and 
it will be necessary to go back to individual test scores 
if iseaningful analyses ar^ to be done. If this procedure 
is followed, raw or grade-equivalent scores should be con- 
verted to their standard-score equivalents before any 
arithmetic operations are performed on then. Appropriate 
tests are discussed in Steps 17 and 20. 



ERLC 



52 



Step 22 



Question Do analysis results favor the tr^atjsent group at the pre- 

selected level of statistical significance? 

Yes Review all evidence compiled during the 
validation process and use judgaenc to 
decide vhether the statistical test re- 
sults can reasonably be attributed to 
project effects 

Ko Reject evidence as being inadequate to 
validate project success 

Conaent Given a statistically significant result, the attribution 

of cause is still at issue. The final step in relating 
an observed effect to the trcatoent requires careful con- 
sideration o£ each of the extraneous effects identified 
in proceeding through the decision tree and estixoation of 
their contribution, in aggregate, to the apparent ixspact 
of the treats:ent. It is, finally, left to the judgment 
of the evaluator to assess the magnitudes of these effects, 
veigh their influence in the evaluation results » and con- 
clude whether or not the treatoent was effective. 



53 



•4 R£C<*K) J- 

|i?!rAcr J 



zTTECf 



res 



$0 



J I 



J 



?2£II57 



It 



1 


XAVDCH 1 




18 ' 






?}L£7£ST I 
SCORES 1 









FiRure I. Decision tree for validatl.ig statistical slgnif icance- 



54 

^6 



























1 







2$ 



3^ 




r 



1« 



17, 



I! 



IS 



m 



Oil AC usee 

Host* 



/X. 



TAT TESTS 



21 



MTJEOrxUTE 
/AT. TESTS 



^^J^Kmtcri UTC I . 

JSTAT- r£STSl 
W 1 



So 




I «£JEC7 I 




i 



Hie decision tree presented in the preceding section of this 
report sbouW enable reasonably unequivocal conclusions to be reached 
regardins the existence or nonexistence of soiae treatsent i3i5>act. 
Difficult as that decision-caking process aay be. even more difficult 
questions arise in assessing the practical value of the observed 
Icyact. Eelevaat questions include, "i^Tiat is the educational signi- 
ficance of a third-of-a-standard-deviation (or any other size) gain 
on a standardized reading achievement test?", "What is the significance 
of a five-point gain in reading coaprehension as opposed to a cosaparable 
gain in vocabulary?", and "Is a sooderate-cost treanacnt vhich produces 
isoderate gains laore educationally significant than a costly treatxaent 
vhich produces larger gains?" 

Consideration of these and related questions quickly brings to 
light the difficulty of making even gross-level decisions in the ab- 
sence of a metric for quantifying educational significance. And many 
tjculd argue that scores on standardized achieveaent tests in no vay 
satisfy the rcquireiaents for such a nctric. Cnfcrtunately, the lack of 
a presunably adequate metric for educational significance does not 
relieve decision-zsakers of their responsibility to choose among and 
act upon the alternatives available to then. Keither does the lack 
of an adequate metric imply that all measurement is infeasible or that 
decisions must be made without useful guidance from educational research- 
Standardized test scores do constitute meaningful Indices and. if 
appropriately interpreted, go a long vay toward achieving their ultimate 
objective- 

Basic to the entire quantification issue is the sometimes overlooked 
fact that educational significance is an inherently subjective concept. 
Vhile scales may be constructed from tfie consensus of experts, it must 
be acknowledged that they will be culture-bound and situation-specific. 



\ 



Er|c 5(5 



Fcrrberaore* there vlll be educators of substantial stature vbo viH 
«21sagree vith any set of consensus-leased priorities and relatix>nsbips. 

A sisrpli: Ulustratloa can be drawn froia standardized reading 
achieveaent tests mere it is comaoa. practice to provide separate 
scales for vocabulary, cosgrehens ion , and occasionally other conpocent 
skills. Clearly these s;xbtests could be velghted aiid cosbiiied in a 
susber of different ways to yield a "^otal Seadins* score. Some 
educators might ar^e that vocabulary and cos^rehension are eqi^slly 
lisportant zsptczs of reading while others slight clalia that cos^rehen- 
sion was tvlce — or five times — or even ten tlises as istportant as vocabu- 
lary. It is clear that this issue cannot be adequately resolved through 
eapirlcal research and can only be dealt vith by "majority rule" or some 
similar » equally cnsatlsfactory expedient. 

Despite the fervor vith vhich thi^ Issue may be debated, the 
method of coobining vocabulary and coup rehens Ion subtest scores to 
obtain a total reading score appears, upon closer examination, to be 
little more than a pseudo-problem, the tvo subtests are so highly 
in tercor related (typically, r ^ .80} that even very different weighting 
syszesas have almost no Impact on the ordering of total scores. In other 
vords, students vlll fall Into very nearly £he same order vhether coisp- 
rehension scores are given ten times the weight of vocabulaxy scores or 
the tvo scales are eqially velghted. Although the empirical evidence 
may be less complete. It appears that many videly debated issues in 
educational evaluation today can be deflated vith the same sort of 
demonstration. Clearly, the argument that standardized achiev^ent 
tests ought not to be used for assessing cognitive growth can be quickly 
Invalidated if the correlations betveea test scores an<? other measures 
purported to reflect component skills more adequately are shovn to be 
high. 

The conclusion, then, must be that standardized tests, vith all 
their deficiencies, do provide a useful metric for assessing the basic 
skills of reading and math. Standard scores on such tests, although 
not comprising ratio scales, do provide a means of quantifying gains. 



49 



of reLatl^s observed ga!rs to expectations ia a reasonable rtamer, 

zn5 of jacasurla^ rbe Impact of special 3n<tr«cti<»nal pr^yject^ coj^itive 
^roviSs. At t&e saxoe tiae, it Is clear that zhty £o not provide a 
complete ans;^7er to the kinCs of ^estlons raised is t2ie first ^r^- 
^raph of tbis section. The difficulty ia coainf to s^ips vitb thc^se 
^estions lies aot la deteraialaf tbe size of tbe ^fas but ia deter- 
2siflios tbeir value . 

The ralue issue -uzs allssded to above ia discossia^ the relative 
lvalue of ;^ias ia vocabalary as opposed to conipreheasioa. la this 
slttiatios, at least • tht issue vas sbova to be a pseodo-probZca and 
i& vas ifi^lied tbat ss&ny slaiiar issues isfght be of far greater tl)c*^r«^- 
it-al pra^tiv-al Ci*nc<?m- The ab^luz<r vaSur «^f ac1ii<^«2aest ^£ns 

isaj also pale into relative iasi^if icance vbea eaisaiaed ia the context 
of real-vorld coatiagencies. An acbfevesicfit' £aia of **JC standard-srore 
points is likely to be vorth exactly the aiaount of iioaey a school 
district Is able or villins to spend to obtain it— and this, in ttsrn. 
vill dep«id on the needs of the children in the district and perceptions 
of the relati\*e priorities exlstins asaong tbea. If needs can be ade- 
quately defined, relative contparisons a7non£ the alternatives available 
to fit thea are sufficient. Jibs9lute scales of educational significance 
isay be required for the typical kind of cost-benefit studies seen in 
the harder science and enslneering areas, but educational issues need 
not be defined in that aanner. 

In their search for effective compensatory education projects to 
pzclca^e^ the authors decided they trould consider any treatcent vhich 
produced one-third of a standard deviation gain vith respect to the 
national nora. Above that point, choices vould be hzsed on judgaents 
reflecting the size of gains, costs, replicability, availability, target 
group served, variety of approach, etc. Their original guess that the 
choices vould be restively easy to cake and unequivocal was substantiated* 
While this exacple xsay be atypical, it seeas that the alternatives avail- 
able to fill a specific need will rarely be so nxjscrous as to preclude 
sound decision-i^akipg by qualified, vell-lnforaed, and thoughtfrl Judges. 




50 



A?P£KD1X A 



Kojanr selectios csiisia msrssssr i 



P20JSCT Tills 







Approach 

Puil-out vs. Vhoie class 






PREHEQUISITES 






O Provides instruction lo reading end/or ssath 






□ Serves children in grades K-12 






□ Serves educationally disadvantaged children 




□ Has achievement test data for oore than one 


! 


"instance" 


i : 

1 


rVifiL ASSESSMENT 
□ Accepted 



□ Bejected 

Heason for rejection 

O Prerequisites not zaet 

□ Ixiadequate evidence of effectiveness 
O Excessive co£ts 

G Hot available 

□ Kot replicable 



59 

51 



?KOJECT SELSCTliXi CRITEEZA VOSXSSEET II 



AVAItA3ILm 

Access Ibllltyr 
O CaA 2>e visited for validation 
O Personnel are cooperative 

□ Procedures, results, an<S costs are ooctszoented 

Acceptability? 

Operational In public schools 

□ Kot primarily a single coscacrcial product 



COST 



□ Squlpsent plus special personnel less than per pupil 

□ Initial iavestjaent less than per pupil ^ 

O (Alternatively) Per-pupxl cost over a three year operational 
period including start-up costs should not exceed $ per year 



REPLlCABltlTY 



All zaajor cosponents can clearly be duplicated. Cotsponents 
include: materials, hardvare, personnel, and environments. 



EFF£CT1VE!;ESS 

ED Achieveaent test data show consistently that actual post- 
treatiDent perforaance exceeds the no-treatoent expectation 
by an anount vhlch is statistically significant and equal 

to at least stardard deviation with respect to 

the national norc. 



?20J£CT SELECTION CEITERIA WORKSHEET III 
i^KALTSIS OF FMJECT SVALDATiOS 

Complete a separate sbeet for each validating site or ccnbinatlon of 
sices for vhich separate data are reported. 

PROJECT TITLE 

Ttyout Croup 

I. Ttyout Suscaxy 

A. Trearraent group description 

1. Kun2>er 

2- Cradcs/Agcs 

3- SES/Ethaic 

^. Prc-project achievenent level 

5. SchooIs/ClassrocQS 

6. Selection procedure 

7- Treatment period dates 

Hours per veek 

3. Conparison group description (if sasie as experi:asntal group 
vrite "saae") 

1- Kuxzber 

2. Crades/Ages 

3. SES/Ethnic 

4. Pre-projecc achievement level 

5- Schools/Classrootss 

6. Selection procedure 

7. Treatcent period dates 

Hours per veek 



53 



?20J£CT SEL£CTI0:3 CaiTEaiA yOSKSHEET III (Continued) 
A2JALYSIS OF ?ECJ£CT SVAUJATIOS 



II. Evaluation Hodel Scployed 
Koro-referenced 
Control z^^>^ 
Eegression 

□ Other (specify) 

. III. Confoundins Influences (connent on itess checked) 

□ Inadequate tests 



□ CciJing/Floor effects 



□ Pretest effect 



Q Croup aesbership effect 



dstiident turnover effect 



□ Inappropriate testing times 



□ Inappropriate coaparlson group 



Q Participant selection via pretest 



IV« Evaluation OutcoiDes 

A. Evidence of Statistical Significance 



B. Size of Cain with Respect to the National Nora 



62 

ERIC 



APPHKD13C B 



KoxiD-refereaced versus Criterion-referenced Tests 



Vhile use of criterion-referenced tests has been advocated for at 
least ten years (Claser & Klaus, 1962), educational projects are still 
evaluated predosixiantly in ten&s of coizaercial, norsa-ro^erenced cescs. 
The reluctance of educators co abandon faailiar testing paradigsis is 
understandable in viev of the continuing conf<3sion over the exact dis- 
tinction betveen the conventional nona-referenccd test and the new cri- 
terion-referenced instnnsents. This confusion is clearly evident in 
recent articles by Airasian and Hadaus (1972), Jackson (1971)* and 
Pophas and Husek (1971) , and in a reviev by Davis (1973) of eight 1972 
AESA papers on criterion-referenced testing. 

The confusion appears to result froa conceptualizing criterion- 
referenced tests as an alternative to nora-rererenced tests. In fact, 
nora- and criterion-referenced tests do not represent zsutually exclusive 
test categories nor do they represent the ends of a continuuza. On the 
contrary, the "nora" and "criterion" descriptors refer to completely 
independent test characteristics, both of vhich should probably be 
included In the description of any test. The problem is further con- 
plicated by the fact that, although there are real differences between 
tests that are labeled "nora-refe* iced" and those labeled "criterion- 
referenced," these labels do not capture the salient distinguishing 
features . 

The doainant characteristic of tests that are labeled "criterion- 
referenced" is that their content is clearly defined in terns of sone 
pcrforaance diaension of interest. This relationship pemlts direct 
interpretation of individual scores in ways which have icmediate prac- 
tical iaplications (e.g., tiae required to run a aile, or proportion 
of the 3000 aost co^raon English words that the individual can define). 
The aisleading label apparently derives froo the failure to distinguish 




5S 



betveen the dioension bein^ measured and the scale adopted to aeasure 
ic. ^ This failure Is not surprising in the context of training program 
developzaent vhich first popularized "criterion-referenced" testing. 
For eragple. Claser and Klaus (1962) vrote: 

Tvo kinds of criterion standards are available for evaluating 
individual proficiency. First, a standard can be established 
vhich reflects the nlnlmm level of |>erforxance vhich pen&lts 

operation of the system. At the other eztreae, proficiency 

can be defined in terms of oaxixauiii system output. The stan- 
dard of s&easurement is then expressed as a function of the 
capabilities of other components in the systea. The aan loading 
a Kavy gun, for exaaple, never needs to load aore rapidly than 
he receives shells frozs the ziagazine below decks. In this case, 
a fairly absolute standard of proficiency is available. fp« 

In this and sinilar situations, it has becooe popular to say that 
a performance criterion has been established and the test used In 
measuring perforaance need only tell us vhether or not the criterion is 
reached. It night be core Inforisative to say that the test aeasures a 
perfonsance distension (speed of loading), that systea requirements dic- 
tate a specific" cutoff score, and that in the interest of econoisy it 
vould be adequate to dichotosize the speed of loading scale about this 
cutoff. Everyone below the cutoff would get a score of "too slow." 
Everyone above the cutoff would get a score of "fast enough." 

The term "nora-referenced" has rivaled "criterion-referenced" in 
terzns of confusion generated. Any test becones a noro-referenced test 
as soon as a nona group of one or oore entities is defined and scores 
of those entities are obtained. Of course, if the nom reference is to 
be of any use there are cany properties that the test and the norm group 
zaust have. The required properties depend entirely on the intended use 
of the test, but one typically desires relevance and proper saapling for 
norza groups, while tests should provide reliable and efficient quantifi- 
cation. 

The relative independence of nona referencing and performance 
referencing can be illustrated by an instrument used to select students 
for pilot training* Successful tests for this purpose can and have been 



56 



64 



developed \i9inz vhat ere usually referred Co as convencioxial nom- 
referenced test developsent procedures. It should be clear froa the 
above discussion, hovcvar, that nora reference is not the salient 
characteristic of sucli tests. Vhile validation groups aust be used 
to develop and scale the tests, the ultimate criterion Is flying 
success, and Is not dependent on standings in relation to any nora 
group. Once a reliable test has been developed vhich correlates 
highly vith a aeasure of pilot success, a single cutoff score, or 
criterion, could be determined, and applicants could be scored either 
pass or fail. 

At the saoe ciae, neither the procedures for developing the test 
nor the final appearance of the test vould classify it as "criterion- 
referenced." lhat is, it is unlilcely that the population of pilot sUlls 
vould be sazq>led at all. Of course, one could say that the final In- 
struaent defined sooething called "pilot aptitude" but it is doubtful 
vhether the concept could be identified fros the test itens or that 
one vould feel enlightened to knov that a person vho scores "X" or 
core points on this aptitude could be taught to fly. An "aptitude" 
as oeasured by correlated items is sicply not vhat ve usually sean by 
a performance dimension. In short, this tsost faailiar type of test is 
neither particularly "nona-referenccd" nor particularly "criterion- 
referenced." 

It should be noted that the concepts discussed above are not nev 
and have been recognized by various authors (e.g., Glaser & Hitko, 1971; 
Davis, 1972). Even these authors, however, preserve the norn/criterion- 
reference categories. Regardless of the tenoinology vhich is ultlisately 
adopted, it xsust be recognized that nev and useful oeasuresaent tech- 
niques have been introduced in the process of attempting to define and 
develop criterion-referenced tests. It should be emphasized that it is 
the categorization that is aproductive, and not necessarily the tech- 
niques vhich have been developed. 



ERIC 



57 



65 



loplicatlons for Project Evalttation 



In contr&sc to the pilot-trainee selection test vhlch was neither 
norz2- nor "perfonaance*'-re£erenced, the cozaoerclal reading and oath 
achievement tests used In project evaluation are both norm referenced 
and perfonsance referenced. The nom group properties need little 
, co2E3ent except to point out that nona groups are typically prc:sented 
as nationally representative (althou^ sodc are clearly acre representa 
tlve than others) and say not be suitable for assessing the gains of 
particular subgroups* ^ 

Ihe performance dioenslon that Is defined by standardized tests is 
sonevhat arbitrary, and It^^y veil be argued that substantial Improve- 
aent Is needed here* Kaw scores are seldon reported In a oeanlngful 
way and Iteos are- probably chosen on the'basls of dlscrlolnatlon rather 
than as a saz^Ie of a carefully defined p2rfonsance dooaln* The prob- 
lens are alc^sc certainly worse In testing reading than In testing Jsath 
but they reflect the basic difficulty in defining what Is aeant by 
reading skill and seasuring It. 

Vhllc conaerclal standardized tests are clearly not optical In- 
struments for research purposes, there is little ec^lrlcal evidence to 
suggest that tests developed according to criterion-referenced proce- 
dures provide better measures of project effectiveness In basic skxll 
areas. While, in theory, criterion-referenced Instruments which are 
focused on the specific objectives of a particular instructional treat- 
nent ought to be more sensitive to achievement gains resulting from it 
than the more general standardized tests, the latter clearly saxsple 
i<!^ortant aspects of reading and math achievement and are relatively 
efficient and reliable instruments. Clearly, criterion-referenced 
or other special-purpose tests are perfectly acceptable for use in 
assessing the statistical significance of project ixspact* If enough, 
is known about their properties, ic should also be possible to estimate 
the educational significance of observed gains* One requirement, of 
course, is that both the statistical and educational significance of 

66 



58 



pre-to-posttest gains oust be assessed against the gains vhlch would be 
expected under no-creatxaent conditions. In the absence of norcatlve 
data, the estlzatlon of no-treataent posttest status clearly necessitates 
the use of s cozzparison group evaluation oodel. 



67 



59 



Estlsation of Treataent Effects froa the Perfoixaance 
of Non-co3parable Control Groups 

Where treacaent and control zxoxxps are significantly different from 
one another 9 it is generally not possible to assess the liipact of an 
educational intervention. In the case vhere a treataent group scores 
lover on the pretest and higher on che post test than an otherwise cost- 
parable control group 9 it is probably safe to conclude that the treatiaent 
was effective but, even here, the aagnitude of the treatment effect can- 
not be accurately estiaated. 

There are sosie evaluation designs which es^loy a non-cosparable 
control group to generate an estiaate of how the treatiaent group would 
have perforzaed on the posttest had they not participated in the treatment. 
The laost widely applicable and plausible of these designs require that 
an original group be dichotomized about sozse pretest cutoff score so 
that all pupils scoring on one side of the cutoff score receive the 
treataent while none of those scoring on the other side are allowed to 
participate. IWo such designs are presented here along with one design 
which does not require such dichotoisization. The designs are: 

A* Tae Regression-discontinuity Model 

B* The Regression Projection Model 

C. The Generalized Multiple-regression Model 

A» The Regression-discontinuity Model 

Thft model which appears tDost icxnun^ to plausible alternative hypo- 
thtses is the Regressioft-discontinuity Model (Campbell & Stanley, 1963). 
A coaprehensive developizient of this model and related statistical tests 
ii available (Sween, 1971). The model requires that treatzaent and com- 
parison groups be developed from a single original group by assigning 
all mtabers on one' side of a pretest cutoff score to the treatment group 
snd all members on the other side to the cciaparison group. Separate 



60 

68 



prctest-posttcst regression iices are thea co25>atcd for each group ane 
the difference betvecn the lines is tested at the point vfaere they inter- 
sect the pretest cutoff value. 

The zodel is rigor^nxs in the sense that, if the procedures are fol- 
loved correctly, rejection of the null hypothesis for any reason other 
than a treatiaent effect is extreaely inpiausible. There are VmO con- 
siderations, hoifever, vhich severely restrict the applicability of the 
saodel. First, it is difficult in a school environaent to enforce assign- 
ment to treatacnt groups solely on the basis of test scores, or even on 
the basis of scores reflecting both test perfomance and a ouaerical 
te:4cher rating. Second* the aodel is not sensitive to changes in re- 
gression line slopes unless these changes are accoapanied by a discon- 
c 

tinuity of the regression lines. This requireaent r^resents a potential 
problen since coaq^ensatory education projects are often individualized 
on the basis of student need. Such individualization cotild produce the 
greater, t Is^roveiaent in those students farthest below the pr*2test cutoff 
score thereby flattening the treatcent-group regression line vithout 
producing'^a discontinuity at the cutoff point- At least one cctcpensatory 
reading project known to the authors appears to produce this kind of 
effect. 

In short, regression-discontinuity analysis is reconmended for all 
cases in chich the conditions for its iiapleiaentation are xact and a posi- 
tive result can be anticipated. It seests unlikely, however, that such 
cases vill occur frequently. 

3. T*ie Regression Projection Kodel 

The Regression Projection Kodel uses a regression line calculated 
froa the cosparison-group pretest-posttest distribution to estitsate what 
the treat ©en t- group post test scores would have^ been under a "no treatcaat" 
condition. Like the Regression-discontinuity Kodel, it also requires 
dichotosization cf a total group into treatnent and comparison subgroups 
about a particular pretest cutoff score. The advantage of this aodel 
is its sensitivity to treatcacnt-produced changes in regression line 
slopes. Its primary weakness is i^- inability to distinguish treatment 



61 



69 



effects frcj -Jther factors vbicb jsay affect the regression line. 

l&e saodei is analogous co the technique of Karl ?earson for esti- 
sating total-group test validity vien criterion measures are a\*aiiabie 
only for tbose Vho score above some selected cutoff point. It is applic- 
able vhere selection (pretest) scores are available for an entire Stoup^ 
but vbere there is no indication of bow the subgroup belov the cutoff 
score x:ould have done on the post test had thej been treated in the saise 
anaaner as the groi;p above tba cutoff. 

Ihe basic assus?tion of the saodel is that :sder no-treatstent con- 
ditions the regression of posttest scores on pretest scores for the total 
group vould be homogeneous and linear throughout the entire score range* 
The regression lice for the cougar ison group is taken as the estisate 
of this total group regression liae, and is projected through the treat- 
bent-group distribution (See Figure 2). This projected regression line 
is then used to calculate the estiisated no-treatncnt post test score* 

The sscdel should be applied with caution since the basic assuisption 
of homogeneous, linear regression isay not be tenable. For ezacple, in 
coispensatory projects, factors which lower the pretest-posttest correla- 
tion for lov-scoring students say invalidate the laodel coxspletely. Floor 
effects on the pretest and other factors leading to lew pretest reliability 
at the lower end of the range are particularly troublesoise. At a ainisuia, 
a good argunent that such factors are not acting is required. . A scatter 
diagram penaltting inspection of the pretest-post test distribution for 
Irregularities is essential. 

Horst (1966), Chapter 26, provides a discussion of the underlying 
statistical issues and presents fonailas for generating xmbiased estinctes 
of the nean, standard deviation, and pretest-post test correlation for 
the total group- Tne esc^ica^ed regression equation for the total group 
is identical to the regression equation for the restricted (comparison) 
group. Thus, one needs only to calculate the regression equation for 
the comparison group and use it to obtain cstiz^ated treatise ^t-group post- 
test scores. This equation can he written: 



70 

62 



%' ' h r ^ic 

t c t c 

vbere h is the slope of the cocparison-s*^^^? regress loa lise sad k is 
c c 

its Y-axls iatercept%. 

If the afaTi pretest score of the treatatent group is substituted 
for la the above «^tioa, vill be the estimate oeaa post test 
score (^^)- differeace betveea the actual a;ad estimated posttest 

scores caa then be tested usiag 



t 



«h5re ?^ = proportion of pupils ia the treatment group 

?^ » proport ioa of i>upils ia the coii!parison group 

K ' nuaber of pupils ia the confined grot:p 

Sy^ = ireigbted iDeaa of tl^ treatcsat- and coa^rtsoa-group 
posttest variances 

= veighted iDcan of th< treaCoeat-^ and con^jsrisoa-group 
pretest variance? 

b^ = slope of thB coDparisoa-gro'jp regression ilt*e 

b - veighted sasan of the slopes of the creatiaeat- aad 
cosparison-grcup re*;r^«5oii llxies 

The derivation of this test is a?t available in the literature aad ts 
sketched ia its entirety belov. £ca-Jers not interested in this derivation 
should skip to the discussion of the Generalized Multiple-regression 
Model which begins on page 71. 

/ Significance Test for tlie Regression Projection Model ^ 

CoTisider first the general situation in vhich a regression line Is 
fit to a pretest-post test score distribution, providing an es tics ted 



ERIC 



Wt are grateful to Paul Horst for the rationale and development of 
this test. Hovever, the authors are responsible for the presentation 
given here and »for any errors it cay contain* 



64 
72 



posttest s£»rc (T) for each pretest score (a)- The equatioa for the 
recession line nay be vrittea 



vhhch is the difference betveen bis actual post test score and his esti- 
mated posttest score or, in other xTords, the dist^ce tliet his ^tual 
posttest score is above or below the regression line. 

Sext, consider the Hegressicn Projection Kodel in vhlch a regression 
line is fit to the coaparispn-groap data and then projected through the 
treatment-groiq* d2ta ^Figure 2)- A distance froa this regression 
line can be cozsputed for each cooparlson-grotsp student. A distance 
froa the saoe corpar ison-group regression line can be coscputed for each 
trearaen t-group student. Because the regression line was fit to the 
coisparison-group data, the laean of the coaparlson-group D values (0^) 
will be zero. Kovevcr, the mean of the treatcjcnt-group D values (D^) 
vlll not be zero unless the laeaa of the trcatoent-group posttest scores 
falls exactly on the projected regression line, that is unless * ^t* 

The null hypothesis which is tested in the Eegression Projection 
Model includes three sajor conditloxis: (a) students are assigned to 
treatsient and cojDparison conditions solely on the basis of their pretest 
(either single or cocposite) scores, (b) posttes: on pretest regression 
is linear throughout the range of pretest scores, and (c) there is no 
treatment effect. If it can be assuacd that the first two conditions 
are inct, and if there is no treatacnt effect, the regression lines of 
ths trcataacnt group, the coaparison group, and the total group should 
all approxicatcly coincide. Deviations of treatoent-group posttest 
scores froa the projected cocparison^roup regression line would have an 
expected scan value of zero under these conditions so that a sizeable 



T « bX -3- k 



vhere 



b slope of the regression line 

k « T-intcrcept of the regression line 



Then, for each student, ve can defli^ a value 



D - r - I 



65 




ERIC 



ERIC 



departure froa this exp^tzzion issy indicate a sls^ If leant trcatioent 
effect. In aa ezrerlsental situation* ve can test vhcthcr the i^served 
aean deyiatlon (5} is larger than vould be expected under cbe conditions 
of the cull hypothesis by coignitiag 

t 

On pa^e 6^, £ is expressed as a function of treati&ent- and coz^^arlson ^;rott|> 
statistics. Ihe equation is derived as follows: 



First ve recall that 



Substltutlag (2) into (1) oay write (1) as 



t2 ^ O} 

2 



Ue can then develop the numerator and denoainator cf <3} sepazately: 
Kuaerator 

The confined nean of the D values can be expressed in teras of the 
cean D values for the tvo groups (vith respect to the cozspsrison-group 
regression line) and the proportions of cases in each group: 

5 » p 5^ + P 0 "ZL (4) 
t t c c 

But since the regression line was fit to the conpartson-group data. 
Substituting (5) into (A) 



5 - Q. (5) 
c 



' 5 - P^D^. (6) 

And since the aean of the D values is equal to the difference between 
the aeans of the observed post test distribution an^ the esti&aced post-- 
test distribution, we can rewrite i6) as: 

5 - p.<y^ - )- (7) 
^ t c 



66 

7i 



the reoalslcg factor in tbe suDeratcr of (3) Is 6i^^ the suznber of de^ees 
of freedom for the standard deviation of D. Csually df^ is taken to be 

vhere N is the a^c^er of pairs of observations. Hovever, two additional 
restrictions hold in this at^el. First » the cosparison-group D valoes 
sust sus to zero ^nd second, the aeaa of the estiu^zed postr^sst scores 
for the treatnect group is determined by the cospzzlson group data. 
Itierefore *> 

dfjj - H - 3. (8) 

3y cosibining (7) and (8)» the nuiaerator of (3) can finally be vritten 

D^Cdfj^)- JP^(y^ - y^)j2 (N - 3). (9) 
Denoaiaator 

It is veil knovn that the variance of a difference between paired 
laeasures is equal to the sun of the variances of the tvo measures ninus 
a correction for the correlation betveen thea. In the case of D values 
fros: the Jlesression Projection Kodel, 

where 

r^^ » the correlation between actual and estimated posttest scores 
s^. ^ the standard deviation of the actual posttest scores 
s^ « the standard deviation o£ Xhe estiiaated posttest scores. 
Since, by definition, 

y - b X + k (11) 

c c 
it can be reatfily shown that 

s- - b^s^ , (12) 

and 

r.jy = (13) 
where Is the pretest-postcest correlation for the coobined group. 



67 



ERIC 



75 



Hiercfore, substicutisg (12} znd (13) in (10) 



This form of the denosisator cculd be usexS for cor^uting t^* However, 
since the treatment and cocs>arlson groups are normally analyzed separately , 
it is desirable to derive s^ as a function of the separate group statistics. 
!?e begin by noting that the covariance Hetveen X and Y (g^) is defined by 

But in the Eegrcssioa Projection Model 
rx y ^ IX Y 

^ - (16) 



£X -5- CT 
ZX to 



and 



£Y 4- lY 
£Y a- t c 



t t ?^ t t 



(17) 
(18) 



t (19) 



rx y o rx Y 

C C P C C 

— = c-^p- (20) 
c 

vhcre and P^ are the proportions of treatment and coiapariscn students, 
respectively. Similarly 

rx rx 

IT ^ IT = t2i) 

•rx P ^^r 
h ^ C 



7G 



68 



ERIC 



c 

Sttbscicutlng (19) throu^ (24) In (16) through (Id) and then the resulting 
equations in (13) ve have 

(25) 



Ivexc, ve subtract the expression (PXY +PXY) froia the first brackets 

t t t c c c 

in (25) and add it to the second to get 

IX Y rx Y 



\ t C ^ 



Buc ve define 



£X Y 

^ - ^t^ 
ZX Y 

«xY - ^ - Vc . <2a) 



c c 
Also ve have 



(P^ - P^2) , ~ ?j) - P^P^ (29) 

and sinilarly 

(Pe -^c^^ " ^c^ 

Using (27) and (28) in the first brackets of (26), and (29) and (30) in 
Che second ve have 



69 

77 



Ut 



ERIC 



t c 

d„ - (X^ - X^) (33) 

A t C 



(34) 



Subtituting (32), (33), and (3«) into (31) 

^"'^^ Wx*^ 
If Y - X, we have f ren (35) 

s 2 « s 2 + p p d 2 (36) 
X X t c X 

Siaiiarly. if X » Y 

s 2 = s. 2 + P P <L,2 (37) 

2 2 t C 2 

Substituting (35), (36), and (37) into (U) 



(38) 



(39) 



(40) 



i-^ - bs„2 . ' (41) 



Rearranging tertss 

Finally, it can be readil} shovn that 

and that 

^XY ""X ' 

78 

70 



Substitutins and (42) in (40) 

D cX y cX tct C 

vhicb is the fona of the denominator in the equation for t on page 64^. 

C> The Ceneraii2ed Multiple-regression Hodel 

Vhere neither of the above sDodels is ix^icated, it say be possible 
to apply a sniltiple regression laodel to the data, provided the evaluator 
can generate a tiseful null hypothesis. Hovever, considerable caution 
and a thorough grasp of the technical issues involved should be considered 
prerequisites for any such effort. In particular, the tfidespread error 
of using regression rodels to statistically equate fundamentally dissimilar 
groups siust be avoided. Casspbell and Hrlebacher (1970) have shovn that, 
in tersas of faciliar "crue score plus error score'' aodels, conventional 
regression oodels systenatically underadjust for the initial differences 
between such groups. More basically, it f*hould be noted that the under- 
lying "true score plus error score" construct is purely hypothetical and 
there is little evidence to suggest thai it provides a useful basis for 
equating dissimilar groups- The behavior of one such group slcply does 
not tell ixs ouch about the behavior of the other. 

However, in special circunstances the Generalized Multiple-regression 
Model cay prove to be applicable. In the siicplest case, the first step 
in applying the aodel is to calculate a regression equation for the pre- 
test-posttest distribution of the combined treatnent/coisparison group. 
The pretest score may be considered the "predictor" variable while the 
posttest score is the "criterion" variable. The variable of interest 
is the "residual variance;'' that is, the posttest score variance which 
is not predicted by the pretest regression equation. 

The second step is to add a "treatnent" tera as the second pre- 
dictor in the regression equation and calculate the residual variance 
about the new regression line. In the sinplest case, the treatment tero 
is a dichotoaous variable which would be given a value of "1" for each 



71 



student in the trcatnent group, and "0" for each student in the coi:5)arison 
group. There is. however, no reason why it could not be a continuous 
variable reflecting, for exa^cple. the hours of treatnent exposure. 

The last step is to test the significance of the difference betveen 
the residual variance computed froa the first prediction equation, and 
the residual variance predicted froia che second equation. The addition 
of the trcataent variable in the second equation acounts to adding a 
constant to each trcataent group score. Graphically, the result is to 
generate tvo parallel regression lines passing through the saeans of the 
trcatoent and conparison grot^s. respectively. The slope of these lines is 
the weighted xaean of the independent regression lines for the two groups 
and will, in general, differ fros the cozabined group regression line slope. 
The significance of the effect is deterained by testing che difference 
between the residual variances froa the tvo prediction equations. 

The oodel is a "multiple" regression oodel in the sense that any 
nuaber of predictors can be incorporated in the regression equation in 
addition to pretest and treatsaent variables (e.g., teacher ratings, SES, 
etc.). The oodel is "general" in the sense that a variety of effects can 
be exanined singly, additivcly, and interactively. For exacple, by 
includiTig a "treatoent group" tiaes "pretest scores" cera it is possible 
to test whether treatnent and comparison regression line slopes are 
significantly different. Finally, by including squared or other power 
teres, the shape of the regression line can be tested. 

It will probably be recognized that the simple case described above 
is the Analysis of Covariancc Model, a familiar special case of the Gen- 
eralized Multiple-regression Model. The Y-axis distance between the two 
regression lines is the adjusted posttest difference. As Indicated above, 
this difference will be a biased estioate if the groups are representative 
of distinct populations. A significant effect would provide a convincing 
(negative) answer to the question "Uere the two groups of posttest scores 
drawn randomly froa a single population?" However, such a conclusion 



72 

80 



is trivial if it vere known in advance that the groups were fundaoentally 
different. Siailarly, it is isportant in all applications of regression 
aodels to state the null hypothesis precisely, and to consider whether 
its rejection will be of any interest. ICherc there is any confusion 
concerning the assusgjtions of the null hypothesis or the lisplications 
of those assumptions y regression oodels cannot be recoccended. 



/ 



73 

81 



APPSN'DIX D 



Hazards Associated vlch the Use of Percentiles and 



Grade-equivalent Scores 



An isportant part of the development of coszerclal achievesent tests 
is the collection of norx^tive data from a large and usually nationally 
representative saxsple of students. These noroative data permit the con- 
version of raw test scores into various types of "derived" scores {e.g. , 
percentiles, stanlnes, grade equivalents) which provide useful frames of 
^reference fpjcjlntjerpretatipn. A percentile score, for exaznple, proviifcs 
an index of an individual pupil's status with respect to h5s age or grade- 
level peers. A grade-equivalent score is Intended to equate an individ- 
ual's raw score with the national average level of perforzsance at sose 
grade level. 

Since all of these derived scores are based on national averages, 
it Is essential that the sample cf pupils tested be truly representative 
of the national population. It is also clear that the sacple xaust be 
la»'ge enough so that randon sacpllng errors are small and one can be con- 
fident that the statistics computed fron the sample are very close to 
those which would have been obtained had the entire population been 
tested. 

The importance of these sampling consider a tic«is is well known and 
asply documented (e.g., Ar.goff, 1971). Unfortunately, even if good nom- 
ative data are collected by a test publisher there Is no guarantee that 
the data will not be cisused, misinterpreted, or both. In fact, the 
conventions adopted by test publishers in manipulating and reporting 
their normative data seem likely to enhance the probability of making 
various cypes of orror.*;. It is these errors which are addressed here 
ratlier tha;; the sampling considerations referred to above* 



0*.g. , Iowa Tests of Basic Skills, 1968 cd.), mid-year {e.g., California 
Achievement Test, 1970 ed.), or spring {e.g., SRA Achievement Series, 




Tl^pbnnative data for many widely used commercial tests are col- 
•^^urlng one short interval of the school year, usually either fall 



82 



1971 ed.}. Vhile a fev tests have es^irlcal nornative data points both 
fall and spring (e.g.* Catfts-MacGlaitie rteadlag Tests, 1964 ed-; Stanford 
Achieveaent Tests, 1973 ed.), it Is a comooa practice to generate derived 
scores through interpolation and extrapolation processes for tiaes vhere 
no empirical data vere collected. 

If a test publisher were to collect norn^tive data froa nationally 
representative sassples of children at all grade levels in tie seventh 
laonth of the school year, it voiild be possible to construct tables for 
the seventh xaonth of each grade level vhlch enabled raw scores to be con- 
verted to their percentile equivalents. The rav score at the sedian of 
each grade 'level distribution could also be appropriately converted to a 
grade-equivalent score. The isedian rav score of the first graders vould 
thus correspond to a grade equivalent of 1.7, the ssallan score of the 
second graders vould correspond to^'a^ grade equivalent of 2.7, and so on. 
Both the percentiles and the grade-equivalent scores deteriain&i in this 
manner could be called eicplrical derived scores. 

Clearly, if children are tested at the saoe tiae in the school year 
as the nomativ d?ita vere collected, it is possible to.deteraine their 
percentile status vith respect to the national scisple. How2ver» vhen 
children are tested at times -M;hich deviate fros the ezsp.irlc3l nonsative 
data points it is no longer possible to interpret percentile conversions 
meaningfully. It cannot be detensined, for exasple, %:hether a child in 
the second month of second grade vho scores at the fortieth percentile 
of children in the seventh month of second grade is above or below aver- 
age with respect -to his grade- level peers. Similarly, it is not possible 
to determine a grade equivalent for any raw score which does not corres- 
pond* to the empirically deterSJincd median for grades 1-7, 2.7, etc. — 
except by resorting to interpolation. 

It is a restively simple matter to generate additional grade-equiv- 
alent scores and percentile distributions by interpolating between empir- 
ical data points or by extrapolating beyond then. The assumptions under- 
lying such projected derived scores , unfortunately, are tenuous at best 
and may be significantly in error. Before discussing projected scores, 

75 



83 



hovever, it is useful to poim out that ev^ sore serious errors c^n 
result from the failure to interpolate or extrapolate. The problen here 
is peculiar to percentile scores. 

JSost test publishers prc/i^e percentile &orms for both the begin- 
aiag and the end of the scho&l year- Many also provide aid-year nonas. 
It is either inferred or asade explicit that the fall noras are "s^od" 
for Septeaiber, October, and Koveaber; that the aid-year norms are good 
for December, January, and February; and that the spring norms should 
*t>c used for testing dates in Search, April, llay, and possibly even June* 
The tables li'hich present su^ nox^ enable one to convert test scores 
to pirrcectlles or, conversely, to determine the test score vhich could, 
presumably, be obtained by children at any particular percentile posi- 
tion vith respect to their grade-level peers. , 

Figure 3 vas constructed fron the nozTcs tables provided by the 
lova Tests of Basic Skills, Fora 5, Level 12. The solid line in Figure 
3 shovs the nuaber of items vhich the test publisher says vHl be an- 
swered correctly by the aiedlaa sixth-grade child at various times dur- 
ing the school year. It iicplies that all cognitive growth vhich takes 
place during sixth grade occurs overnight on Kovesiber 30th and Febru- 
ary 28th. The hypothesis that growth occurs in this aanner is certainly 
untenable. 

A aore believable expectation for the cognitive growth of average 
sixth graders is shown by the broken curve vhich crosses the line rep- 
resenting the test publisher's "fiftieth percentile child" at sald- 
Octcber, aid-January, and aid-April. If this line is taken to be a 
reasonable representation of the "real" aedian sixth grader, then com- 
parison with the test publisher's "hypothetical" oedian sixth grader 
will show the real child below average at the beginning of each noraing 
period and above average at the end of each period. 

The aaounf of tla«-related distortion inherent in the noras is 
shown in Figure 4 where raw scores at the beginning and end of each 
nCfJmattive perioi were taken froa.che broken line in Figure 3 and con- 
verted to percentiles using the test publisher's tables. In assessing 



76 
84 



40 



35 - 



TEST 
SCOKE 



3D- 



25 - 



20 



PUBLISHER'S KEDIi^ 
"HEAL- K£Z>IAN 



1 ^ 2 ' 3 ' 4 * 5 * 6 * 7 » 8 ^ 9 

SCHOOL YEAX MONTHS 



Figure 3- Cogni^tlve grcwth shown by the test publisher's sedian versus a 
aore realistic expectation 



the progress of an individual student or the effect of a special instnxc- 
tlonal treatcent, it is readily apparent that one would get results froa 
pretesting early in a nornative period and post testing late which would 
differ dramatically froa the results which would be obtained from the 
coabination of late pretesting and early post testing. 

Where percentile noms are presented for the beginning, niddle, and 
end of each school year, it secas highly likely that they arc "correct" 
at some, point in tine within each of the three-oonth, noaioal norn inter- 
vals« Those points in tise, however, are unknown except in cases where 
ezBpirical data have been collected- Where nor&s have been generated 



77 



85 




SCHOOL YEAR HONTHS 



Figure Publisher's percentile corresponding to the •*rear' median la 
Fijjurc 3 at the beginning znd end of each nonaiag period 



through Interpolation and extrapolation, it is probably safe to assuzie 
that the correct point is soaevhere near the oiddle of the interval. 
However, any particular point which is chosen nay be sufficiently in error 
to distort the findings of an evaluation study. 

The sace kind of problen exists with respect to grade-equivalent 
scores* These scores are usually derived as follows: (a) cedian raw sco&e 
values are identified for each grade level at the oonth the test wa? 
nomcd (e.g., 1.7, 2<.7, 3.7, etc.) and equated to these grade equivalents. 



78 

-86 



Cb) the Interval betvccr acdians Is divided into ten equal parts, and 
(c) the iaterocdlatc sradc-etjuivaleflt scores are equated with the nearest 
integral rav score value. The assus^tloa -s^ich underlies this procedure, 
of course, is that the cirmber of items answered correctly is a linear func- 
tion of tisjc over the nine aaonths of the school year and that a third as 
such s^In is aade daring each of *the three sunner saonths. This is es- 
sentially th? saac assi«ptioa viiich underlies projected percentile sorsES. 

A nusiber of studies have been undertalcen to investigate the validity 
of the linear growth assar^tion, with perhaps the greatest asouat of 
attention focused on the susaner period where it appears aost questionable, 
rinding^ have not always been consistent with respect to the direction 
of deviations between e=5>irical and projected data points, but it is quite 
clear that such deviations are the xule. rather than the exception. 
Wrightstone, Eogan, and Abbott in a recent publication (undated) of the 
Test Departnent,. Harcourt 3race Jovanovich, Inc., concluded, "Interpolated 
points laay be considered as reasonably good estimates of the actual norms 
line if empirically detenained points had been available for all tiaes 
in the year- They are, however, almost certainly in error by soiae small 
amount In snost cases and by a substantial amount in some cases Ip- Sj*" 

3eggs and Hieronymus (19S3) found different patterns of gains and 
losses with respect to the linear growth expectation on different subtests 
of the Iowa Tests of Basic Skills. They observed consistent and sub- 
stantial sumaer losses in language and arithmetic areas but not in reading. 
Other deviations were noted but they were not consistent from grade to 
grade or even at different achievement levels within grade. They reported 
some evidence of accelerated growth from mid-January to nid-April in the 
language, work-study, and arithmetic areas. 

Housley (1973), using the Stanford Achievement Test (1964 ed.), found 
that children showed neither gains nor losses from June of their third- 
grade year to the following September in either vocabulary or reading 
comprehension. Thomas (1975) reported similar findings from a study con- 
ducted In the San Jose, California school district, but Heyns (1975) re- 
ported reading achievement losses over the sucrser for blacks and low SES 

79 

87 



Son* of the s»st iarer-esticfi data be ft^ssnd £a tbe technical 
aasiuals of the test publishers — pzrtiGil^rly v3^re tests have been 
aoroed tvice e^rin^ the school year azwS i^re ^^ah percentiles and grade- 
e?3u^vaJeat scores are presented. The issue of interest in these in- 
stances is thai t >e fiftieth percentile chiU is ao t alvavs a t ^radc 
leyrl! On the }4etropolitan AchXevtsaent T«sts ilSlO cd.), for exaaole, 
the laedian third grader is twt> ir^ths below grade level in reading 91 the 

nd of tht? school year. Siasiiariy, the median fourth grader is t«o 
rr^KA^i^ aJ^ifad iif grade l*-vel in math at the end of the school year. 

lf:fse ^t^^^li*-' r«^ult fr^ 4i c^^i^inat iit*a of tw factors: <a} the 
C':wmt if»ns ei=?loyed by t-fSt publ^^hers in developing derived sc«3res 
:;*nd <M the fact that cognitivv r^rmnh is not a Ucear func-liua cf tjisf. 
It is stated rd practice* for exacpfe, to pri^viCe a single tabi*? con- 
vtrrtir^ r*r-r scures* to grade e<piva3ents f^r each lev^-j t'f a test. To 
di* so, x,if ctnjr^H,-, re«5«irc-s that the siedian child achieve a higher raw 
score at each successive p^^int in tim-~ A loss of raw score points over 
the 5iua2er w^njld produce thtr interesting situation v^.ere a singU- score 
wcnjld correspond to three different grade equivalents. Figure 5 illus- 
trates precisely this ph^nmsenon. 

The data plotted in Figure 5 are tak«m frira the Kgrg ns Booklets, 
Fora 3, of the Stanford Achievement Test (H^ircouTt Brace Jovanovich, 
Inc., 1973) - The data points connected by the solid lines represent 
the scaled scores in Hathenatics Computation of the cedian child at 
grade levels 3.1, 3.8, ^.1, 4.8, 5. J, and 5.8 (raw scores had to be con- 
verted to scaled scores since the data were drawn froa three levels of 
the test). The points connected by the bmken line are scaled scores 
achievifd by children scoring at grade level at the sacc points in tiJsc. 

If the solid line in Figure $ were used to convert scaled scores 
to grade equivalents* it can be seen that a score of 146 would convert 
to both 3*7 and A.i. A scaled score of 147 voulfi correspond to three 
different grade equivalents. 





To avoid the confusion tfiat nlghr result from using a grade nonas 
line such as the solid line In Figure 5, test publishers have adopted 
the convention of constructing a smoothed line to convert raw or scaled 
scores to grade equivalents. Such a saoothcd line, of course, gives the 
zaistaken impression that learning is a nore orderly phenoaenon than It 
really is and introduces distortions of sufficient nagnitude to obscure 
whatever effects aight result froa any educational intervention, Froa 
the data reflected in Figure 5, for exasple, it can be shown that the 
third grader who scored exactly at the national average on both pretest 
and posttest would achieve grade-cquivalcnr scores of 3.1 and 4.3 rc- 



81 

89 



spcctlvely aad vould appear to have scadc a tveive--2»3th gain In the 
seven-Qonth period hetveen the test lags. 

The exaople presented in rigjire 5 is extrese, and other exaii^les 
could be presented ^^ere the ei^irical data points correspond precisely 
with the projected points. Exacples could also be presented yhere the 
distortion resjltlag froa^ interpolation or extrapolation is in the op- 
posite direction frca that in the given exaicple. 

It should be clear from the above thac projected grade-equivalent 
scores (and projected percentiles vhlch reflect the saicc types of dis- 
tortion) nay deviate substantially fron what they seen to be- Such 
scores will oftea sot represent the median level of perforznance of 
children at the corresponding grade level. Ftirtherzore, it can be 
-shown that errors as large as several ssonths are not uncossDOn. 

hespize these problems, if It could be dexsonstrated that the errors 
In grade-equivalent scores were randoa vith respect to the aoount and 
direction of the distortion Introduced, then it sight still be possible 
to draw valid Inferences regarding the effectiveness of educational 
prograsis under certain clrcunstances. Inhere such prograz^ had been 
evaluated using several different tc?t instruments at several different 
grade levels, for example, it sight be safe to asstime that the errors 
cancelled each other out and that cean grade-equivalent gains calculated 
across all pupils wou"cd be unbiased. 

It is not possi'^le, at the present tiae, -to determine whether or not 
use of grade-equivalent scores to evaluate educational programs intro- 
«!'»ces systematic bias. To do so would require a demonstration that the 
gains made by median children tt^e national norm) were consistently 
non-linear over t!,e ten-month school year. If the average gains per 
zaonth were greater during that portion of the school year between 
fall acd spring than between spring and fall, fall-to-spring grade- 
equivalent gains would bt systematically inflated. Similarly, they 
would be systejsatlc-illv too low if the opposite pattern of gains pre- 
vailed. 

The evidence cited above which found losses over the summer or gaine 

82 



90 



vhich were less than vould be predicted under the linear ^rovzh assasp- 
tion tend to support the hypothesis that grade-equivalent zains vill be 
spuriously high from s fall pretest to a spring posttest. The findings, 
however, verc not consistent with the possible exceptions of language 
and arlttaietic. Certainly the research literature is not definitive on 
this issue vith respect to reading. 

Again, the normative data contained in the aanuals accoispanying 
tests with both fall and spring standardizations arc relevant. They 
^oo, however, reveal an inconsistent pattern. The fiftieth percentile 
Total Reading score on the Metropolitan Achievcrscnt Tests (1970 ed.) 
is at grade level at the beginning of each grade and typically somewhat 
belov grade level at the end of each grade. This pattern vould result 
in grade-equivalent gain measures whi*:h systcsatically underestissated 
real cognitive growth. 

Reading Conprehension scores on the Stanford Achievesaent Tests, 
Form A (1973 ed.), show exactly the opposite pattern. At every grade froa 
first through eighth, the nedian fall score is below the grade norm line 
(grade level) and the nedian spring score is above it. Consequently, all 
fall-to-spring, grade-equivalent gains vill be spuriously high. 

A sooewhat core consistent pattern can be observed in the test scores 
of children achieving below the national average. To illustrate this 
point, grade-equivalent scores on a variety of reading tests were drawn 
froo the publishers' laanuals for the 22nd percentile* child. (This par- 
ticular level was chosen because it is thought to be about the average 
for the ESEA Title 1 population.) Scores were collected for six instru- 
sents in all, at both fall and spring data points froa grade 1-7 through 
6.7. Grade-equivalent gains were computed for the fall-to-spring (school 
year) and spring-to-fall (sunnier) tice intervals for each test. These 
gains were then divided by the number of school-year aonths in the inter- 
val to yield the average number of grade-equivalent sonths gained per 
school-year sx>nth. 

Table 1 sunnarizes the gain data for the three tests which had ea- 
pirical data points in both fall and spring (Cates-MacGinitie, 1964 ed.; 

33 



9i 



TABLE 1 



Xonthly Crade-e<2uivalent Gains in Reading 
at the 22xxi Percentile on Tests with Two 
Empirical Normative 0ata Points 



Gates Metro Stanford Mean 



Tine Period 



First Grade 



Suwer 


.00 


.50 


.33 


-28 


Second Grade 


I.OO 


.83 


1.00 


-94 


Suiaer 


.00 


.50 


-.33 


-07 


Third Grade 


1.00 


.33 


1.00 


-78 


Suaaer 


.33 


.75 


.33 


-47 


Fourth Grade 


1.00 


-83 


.86 


-90 


Suaaer 


1.00 


.75 


.00 


-58 


Fifth Grade 


1.00 


1.17 


.93 


1-03 


Susomer 


1.00 


.25 


1.17 


-81 


Sixth Grade 


-71 


-83 


.57 


-70 


Average Grade 


-91 


.80 


-87 


-87 


Average Stiisaer 


-47 


.55 


-30 


-44 


Annual Expectation 


-70 


.70 


-70 


-70 



ERIC 



84 

92 



Metropolitan Achievesacat Tests, 1970 ed.; and Stanford Achievcjacnt Tests. 
1973 cd.) The scales represented ar.e Total Reading for the KM and SAT 
and Heading Cosprehcasioa for the Cates-MacCinitie (vhich does not pro- 
vide Total Heading scores.) Averages calculated across grades and sunaers 
are presented for each test, and seans calculated across tests are pre- 
sented for each school year and each sunaer. The data labeled Annual Ex- 
pectation arc the mean aonthly gains for each test over the entire period 
f ron the end of first grade to th*^ end of sixth grade. 

The aaost significant finding reflected in Table 1 is that, on the 
average, the conthly gain during the school year is almost exactly twice 
that vhich occurs over the susacr. A child who caintains his status over 
the ten school-iaonth period will average .87 nonths of grade-equivalent 
gain per school-year sonth f rca fall to spring and .44 months per tconth 
froza spring to fall. 

The sane kind of analyses were carried out with three tests which have 
only one eopirical data point per year, the California Achievement Test 
(1970 ed.), the Iowa Tests of Basic Skills (1971 ed.), and th* SRA 
Achievement Tests (1971 ed.). The results of these analyses are pre- 
sented in Table 2. It is interesting to note, in that table, that school- 
year gains are only about 30% higher than summer gains for these tests 
rather than 100% that was observed with thos^ tests normed twice a year. 

In attempting to interpret this difference, it is lml>ortant to note 
that the basic raw-score-to-grade-equivalent conversion is probably not 
significantly more accurate for the double-normed tests than for those with 
,nly one empirical data point. The Metropolitan Achievement Tests inter- 
polated grade-equivalent scores, in fact, were derived entirely from the 
fall data points in exactly the same manner as has generally been employed 
by test publishers when only one data point was available. The practice 
followed with the Stanford Tests was somewhat better but it, too, involved 
curve fitting and smoothing operations which clearly introduced some dis- 
tortions. 

Since the difference between the patterns of gains on the two sets of 
tests cannot be adequately explained in terms of the conversions tables, 

85 



ERIC 



93 



\ 

i 



TABLE 2 

m 

Monthly Grade-equivalent Gains in Reading 
at the 22nd Percentile on Tests with 
One Efi^iricai Noniative Data Point 



California lovra SRA Mean 



Tiae Period 



First Grade 



Suraer 


1-25 


.38 


.25 


.63 


Second Grade 


1.17 


-95 


1.33 


1.15 


Sumer 


.75 


.15 


.75 


.55 


Third Grade 


.67 


i.03 


1.17 


.96 


Sufflner 


.50 


.63 


.75 


.63 


Fourth Grade 


.67 


.92 


.67 


.75 


Sunoer 


1.00 


.88 


.50 


.79 


Fifth Grade 


.83 


1.00 


.83 


.89 


Suisse r 


.75 


1.00 


.75 


.83 


Sixth Grade 


.50 


,83 


1.00 


.78 


Average Grade 


.77 


.95 


1.00 


.91 


Average Sucsser 


.85 


.61 , 

C 


.60 


.69 


Annual Expectation 


, .80 


.81 


. .82 


.81 



C6 

9-i 



It has to result froa the presence of eapjlrlcal raw score distributions 
both fall and spring for one set of 'tests and not for the other. Vhere 
tests have only a fall or a spring empirical data point, the score dis- 
tributions at the other period oust b estisated by interpolation. The 
data in Tables 1 and 2 suggest rather strongly that the interpolation 
procedures used substantially overestimated gains fro3 spring to fall and 
underestiaated gains from fall to. spring. 

For 22nd percentile children vho xaaintain their status with respect 
to their grade-level peers. Table 1 presents the grade-equivalent gains 
they can be expected to sake on the Gates-HacGinitie, Metropolitan, and 
Stanford Achieveaent tests since the gains shown are all espirically de- 
temined. The gains shown in Table 2 on the other hand, are not eapiri- 
cally determined except over full-year peripds* It is possible, however,' 
to estiaate- how the average 22nd percentile child would score on the» tests 
represented in Table 2 if he showed the sane relative growth rates fros 
fall to spring and spring to fall that were derived froB the tests in 
T^ble 1. Such a child would have to gain 8 grade-equivalent oonths over 
the school year (Expectation from Table 2) while growing twice as fast 
from fall to spring as froza spring to fall (Mean growth rates from Table 1). 

If one assuiaes aid-October and xald-April testing dates, then the 22nd 
percentile child would, on the average, show a xaonth-for-xaonth gain froa 
fall to spring (six months) and half-a-month-per-month gain from spring to 
fall (four months) when tested with the tests normed only once a year. 

The conclusion that a 22nd percentile child would show month-for- 
iBonth gains over the course of the school year while simply maintaining 
his status with respect to his grade-level peers seems intuitively non- 
sensical. It becomes shocking, however, when one considers that monch- 
for-aonth growth is often taken to be the criterion of success in special 
coapensatory education projects which supplement regular school experiences. 
To the extent that the analysis presented above is valid, month-for-aonth 
gains would be expected in the absence of any such special ef forto l 

The sum total of evidence presented in this appendix, while not en- 
tirely conclusive, suggests rather strongly that the obvious incongruity of 



22nd percentile children S3r.ing aopth-for-aonth gains does not result froa 
the analytic step taken to arrive at that expectation but rather from the 
anosalies inherent in projected percentile distributions anc grsdc-equiva- 
lent scores. Such scores appear to reflect both randos and systeisatic 
errors of sufficient zsagnitude to invalidate any attespt to conduct a norta* 
referenced evaluation. If nona-referenced evaluations are to have any 
credibility imatsoever, they naist be based entirely bn eispirical score dis- 
tributions or projections of no sore than a fev weeks in either direction 
froB such points* 

Additional Problens with Grade-equivalent Scores 

It night be argued that even though grade-equivalent scores systexnat- 
ically distort relationships between raw scores and eiapirically detenained 
cognitive grovrh rates, the distortions are saall enough so that they are 
xnore than counterbalanced by the advantages such scores possess with re- 
spect to simplicity and ecse of understanding* * The evidence presented 
above should be sufficient to dispel any illusions of this type as far as 
nonn-referenced evaluations are concerned* The following discussion is 
intended to show that the apparent simplicity of grade-equivalent scores 
is entirely illusory and, furthermore, that they are scaled in such a way 
as to preclude their treatment with conventional statistica^^^echniques. 

The logical problems with grade-equivalent scores are veil covered 
in many ^^f the teachers' guides accosipanying cozzaerclal tests* Specif i 
caMy, a sixth grader who obtains a grade-equivalent score of four on a 
test, is not really like a median fourth grader at all* Similarly, a 
second sixth grader who obtains a grade-equivalent score of eight is not 
like a median eighth grader. All ::hat can be said is that these two sixth 
graders obtained the same scores that median fourth and eighth graders 
would have achieved on the sixth-grade test. Since their experierxes, 
training, and intellectual growth rates have been very different from 
the students in higher ox lower grades. It is not very meaningful to 3iake 
implicit comparisons between them — particularly sittce these comparisons 
contain no information a^ to where the two childi;/en stand with respect to 
the achievement score distribution of their sixth-grade peers* 

88 



' 96 



Tht iaterprctttloa of ^rMde-^e^ilmlitnt scores- Is farther caapllcated 
by tlie cowc sLscracejicloa t^ut beias^« jexr aiove or below s;Eade level 



tables, for xsx stn£«xdized sc&lercaest test clearly sbovs thst this Is 
not true. Q& the Xetropolltaa Achiereaezit Tests, for cxsaple, s secc^ 
$;rM€T vbo scores « year helov x^aSe Xerel in Total leadis^ at the ead 
of the school year Is at the fourth percentile of the satloaal distrlbo* 
tios. A sixtb-s^e child. scori22g s year belov j;rade level, hovever* is 
the 38th perccstile. The tyo points are separated by alBott ooe-and- 
oce-half standard deviations! Ht is also Interesting to cote that, ac- 
coxdins to the saae norms tihlJm, no children in first trade or the be- 
f^*"*^ secood s^ade are a year belov si^ade level. 

y!rott s pro^raB evaluator's staodpoii^. th^ scalins probleas are even 
aore troiiblesoBe than the logfcal ones. The sajor diff lenity is that 
the overall relation .of achieveaeat to school ^^CMde is not lihear« as 
j;rade-eqoivalent scores voold i^^ly- The effect of this non-linear re- 
lation is illastrated schematically in risore 6 for reading. Ho sisnifi- 
cance should be placed on the exact shape of the carve or the values in _ 
the f i^re« It is sisply intended to sags^st that the averafe stbdent 
learns to read fairly veil by the tise'^he^ coi^>letes junior high school 
and thereaf;:er sslces relatively ssall gains in reading speed or compre- 
hension (as distingolshed from vocabalary}* 



Th« reading aldll of the SDth percentile student in each grade, as 
measured on an achievement test, defines the grade-eqsivalent scores for 



Xt can easily be seen that, on this hypothetical carve, *%alf the sixth- 
grade reading skill is represented not by s third-grade score, but by 
a second-grade score* Similarly, a fifth tr^er vould be half vay betveen 
third and ninth grade in terms of reading skill, lAile on a linear scale, 
the half-vay point woold be sixth grade. 

Vhile a curvilinear relationship between grade and skill level voold 
be soff iciant to invalidate most mathoatlcal operations performed on 



has the 



-mm^i^^ at different grade levels* Zximisation of the norms 





ERLC 



89 

9* 




9C 

98 



Keaa Iradlag Comprehtazion Scores £or Two 
Hypotbetlcal Stsdents od toe Coi^ebtuiTe Tests 
of Suic Skills If arm 1} 



Xsv Score Scsle Score Grsds BfelYSlcat 



Pretest - Crsde 6^1 

Stoicnt A (leZUe) 15.00 396.0 3.70 

Stodent B (MlUe) 34.09 573.0 9.20 

Kess 24.50 4S4.5 6.45 

Grade Zqairslent 5.S0 6.08 6.45 



Error -4.9Z -0.31 45.7X 

?o«ttest - Grade 6.75 

Sttideat A a6Zile) 17.00 415.0 4.10 

Stodent 1 (84Zile) 35.50 592.5 9-75 

Mesa 26.25 503.0 4.10 

Grade 2qalvalent 6.3S 6.73 6.92 

Error -5.52 -0.3Z +2.5Z 

Gain - Grade 6.1 to 6.75 

Student A (26Ille) — _ 0.40 

Student 5 {S4Zile} — — 0.55 

Kean .58 .65 .47 

. Error -10.8Z O.OZ -27.7Z 



ERIC 



91 



fj»dc-ftq:tinlat 9caz€S^ tbere Is Mome erlieace th^t acccal learnlas 
cxtrvcs are considcrablj aore Icrefalar^ aad that cxxrves for faster aad 
slover lesmcrs are act necessarily the asaie shepe as those for average 
learDers* In gecexal, areza^lstg badly scaled x^ade-e^ulvalent scores 
for stade&ts o£ different ability lerels preclndes aity precise interpret 
tatioQ of grocip perfoxBezice. 

Table 3 presents ma cxaaple of icat can happen 'i^m scores on a non- 
equal isterva} scale are averaxed* Tvo hypothetical students vere chosen 
to present one standard deviation beloir the sean and one ^tMsyiMrd deviation 
above the wan, respectively, on the Cci^ehensive Test of Zasic Skills 
(fcrm Jt) Bfadlng CoB^rebension Scale. Sorsative data from grades 5.1 and 
6.75 were arbitrarily selected. In this case, usln^ the sain coated from 
staisdard scores as the "correct** gain, the Man grade-eqaivalent score 
trnderestiaetes the true gain by nearly tvo -rvsths. Vhlle the selected 
exaaple is probsbly not typical of the effect, averaging a groop of grade- 
equivalent scores will alaost alvays yield a result vhich is sobstantlally 
different f roa that lUch voald l>e obtained by averaging the correspon- 
ding standard scores and then converting the aean standard score to a 
grade equivalest. 



92 

im 



VI. 2E?£S£SC£5 



Airsslja, P. 9.y £ Hidecs, I?. F« Criterloa referettf^ed tJtstisz la the 

aeot: la Sgaeatloa , 1972, 2L 1-^- 

Aagoff, V. a. Scales, ootms, £ad eqairalesC scores* Is IborcdUie 
(Ed.), f^tacatlosal acasarepeot* Wt^lagtoo, p. C.c ^?»»yfr*n 
€otsT'^*l ca Sdtx:«cloa, 1971. 

5e^s, P.,!*., 4 Sleroajacs, A. X. Csl&nftltj of gcDtfth la the bseic 
flS^lls tfarcvgl^^^ the school yesr sz^d dcrlas tbe passer. Jottrpal 
of Educatloagl Measorrymt, 1963, 5^ C2), 91-97. 

Cftfipbeli, D. T., & Erlebacher, A. £. Hcv rc^ressloa ertlftcts la qts&si- 
experlseatal eralc&tloas caa clstaliealy salie cosspeasatory edocacloa 
lool: hars£td« la J* Eellmth (Ed.), Dlsadraatased child . Vol. 3. 
Ccsficasatary edocatloa s A cattoaal debate. Sew TcrJt: Brcncer/ 
Mszcl, 1970. 

Canpibell, D. 7., & Staolej, J. C. Ea^rlxseatal acd ^tsast-^experlseatal 
deslgss £oT research oa ceachiag. Is 27. t. Gage (Ed.), Eaadl>ool: of 
regearcb oa teachlag . Chicago: Icaod HcKallj, 1963. (Alfto pc£>llsbed 
a« Expcrlneatal aod guagl-experlaeatal dealgns for research . €3ilcago 
Eaad McSally, 1966.) 

Pavis, F. 3. Criterion refereaced iseasureaept. £SIC Clesrlaghoose on 
Tests, HeasureiBeat, & Evaluatloa. Prlacetos, S.J.: Edocatloaal 
Testlag Service, 1972, ( Eeport 12). 

Davis, F. 3. Crlterloa referenced aeastrreaeat. £EIC Clearlaghocse oa 
Tests, Msastxreseat, & Evaluatloa. Frlacetoa, S.J.: Edocatloaal 
Testlag Service, 1973, (TM Seport 17). 

Claser, R., & Klaus, D. J, Proficiency saasureseat: Assesslog hucaa 
perforsaace. Ia £. H. Gagae (Ed.), Psychological prladples of 
systea developaeat. Kev York: Holt, Siaehart, & ttlascoa, 1962. 

Glaser, R., & Ifltko, A. J. Measuresent la learalag ard lastructloa. 
Ia R. L. Ihomdlke (Ed*), Educatloaal «easiire»eat. Uashiagtoa, 
D. C.: Aaerlcaa CocsqcH oa Edtscatloa, 1971. 

Cull ford, J. P. Fuadaaeatal statistics la psychology aad educatloa* 
(4th ed.) Kev York: HcCraw-Hlll, 1965. 

Barcourt Brace Jovaaovlch, Inc. Staafbrd Achlevesent Test, Ksaual 
Part II, Hor«s booklet, For» a. Kev York: 1973. 

Heyns, B. Exposure and the effects of schoolloa . Berkeley, Ckllfsrola: 
University of California. Technical Eeport uader KIE Grant Ho. 
30713, 1975. 



93 



101 



fiprst, ?• ^rircholcgicel »e«»areafeat ..^ad prfedlcttea > Selsoat, Cill£>taia: 
:£sd9iff>rth, i96S. 

Sorst* P- EfSfcct of treataeat m a ypgclal case of ^eger«?f red Kqltlplc 
r<are>»lo3> E^gese, Oregoa: Orcigoa Se^e^rch lastltcte, 1974 (031 
Technical Xeport Vol. U» Ko« 2). 

Jscksoa. JU PcTcloplog crlterloa rcSereoced testg, DLIC Clearlcghou»e 
oa Tests. Ntastzreaeat* & Erzlostloa* Frloceton, S.J.s Sdacsdoasl 
Testlss Service, 1971 (IM Jiejwrt I). 

LeviDC, S*, & Acgoff, V- H- The effects of practice and grovth oo 
•eoreg the Scholaatlc Aptltute Test, rrlficeton, Educa* 
tioaal Testios Srrvice, February 1958 CK £ad S& So. 55-6/52*58-6). 

t>ord» T» Km Elcaestary xodels £}r Kastzriog chaage. la C. V. Barria 

{gd.> , yrdbless ia aeasariag cSiagge , Mcdisoa, VLsconsin: Uaiversity 
of Idscoasia Press, 1957. 

Hacsley, V. Testlag the "stoser learolag losflT aigtseat. Fhi Delta Kappaa , 
1973, 54, 7^5. 

Farsoas, H. H« Ifcat bappeced at Havthorae? Scleace, 1974, 193, 922-^32* 

Popihas, V« Jw, & Busek,' T. £• Icpllcatloas of crlterloa-re&reaced 

aesscreaest* la tf« J* Fophaa (£d«), Crlterloa- re fererced aeastire- 
sea£« Epglevood Cliffs, H« J«: Edocatioaal Techsolcgy Ftbllshers, 
1971. 

Porter, A. C. The effects of usiat fallible variables ia the aaalysis 
of covariaace^ U8pt2>lisbed doctoral dissertatioa, Uaiversity of 
Vlscoasla, 1967. (Uaiversity HlcrofHas, Aaa Arbor, Hlchigan, 1968} • 

Saret sigr, C* The 0£0 P. C* experiaeot aad the Joha Heary effect* Phi 
Delta Kappaa. 1972, 53, 579-581. 

Staaley, J. C. Reliability. Ia !L L. Ihoradikc (Ed.) , Educational 
geasPTcaeat. (2ad ed.) Sbshiogtoa, D. C.s Aieericaa Cotiacil on 
Educatica, 1971. 

Stfeea, J.-ii. The crperl»eata3 " ggressioa desiga — Aa Inquiry iato 
feasf^ility of coaraodoa _ eatneat allocatioa . Coptijli Aed 
doctoral dissertatioa, Hcrtbvestero Doiversity, 1971. 

Thomas, K. A. Cogaitive grovth over the sa—er aad effects of howes 
OB schools . Califoroia, Eaod Corporatloa, 1974 0K-^25-SI£)* 

tfsitehead, T. H. The iadustrial vorker. Vol. 1. Cambridge, Ksssachu^ 
'/* setts: Earvard tfoiversity Press, 1938. 



94 



WLseTv B. J- Statistical orlociples la expertoeatal desljco - (2&d cd.) 
Kev Torkr IfcCrsw-^Ul, 1971« 

VrlgbtstMC, J- W,, Epgtt, T. P., i Abl>ott» Accouatablllty In 

edacatloa aod associated ce&sureggat groblesg . Test Service 
Soteboolc 33- Kes# Yorks fiarcoart Brace Joraaovlch, Isc^ (usdated) . 



I 



95 

103 



